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THE WATER-JAR EINSTELLUNG TEST AS A 
MEASURE OF RIGIDITY 
EUGENE E. LEVITT 


Institute for Juvenile Research, Chicago 


The water-jar Einstellung test was 
first used experimentally by Karl 
Zener and Karl Duncker at the Uni- 
versity of Berlin in the 1920's. It 
was formally introduced into Ameri- 
can psychology by Luchins in 1942 
(34). The test consists essentially 
of a series of simple arithmetic prob- 
lems couched in terms of three water- 
jars each with a known maximum 
capacity. The Sis required to manip- 
ulate the jars so as to obtain a given 
quantity in one of them. No other 
measure except the maximum ca- 
pacities of the jars can be used. The 
first group of problems presented to 
S can be solved by filling the largest 
jar, and then emptving it twice into 
one of the smaller jars, and once into 
the other. If the jars are labeled A, 
B and C, the solution, which is the 
simplest available, follows the form 
B-A-2C. These problems are 
called “set’’ or Einstellung problems 
since their intent is to habituate the 
S to the B—A—2C solution. An ex- 
ample set problem is shown below; 
the capacities are noted on each jar. 


1 This study was begun while the author 
was associated with the Iowa Child Welfare 
Research Station. During that time, its 
preparation was supported by Research Grant 
MH-301 from the National Institute of 
Mental Health, U. S. Public Health Service. 
The author is indebted to Dr. Milton Rokeach 
for advice and encouragement, and for a criti- 
cal reading of the first draft of this paper. 
However, the author is solely responsible for 
all general conclusions stated in this article. 


so | | 81 { | | Obtain 17 


A B C 


Immediately after the set problem, 
the S is asked to solve a second group 
of problems for which the habituated 
solution suffices, but which also may 
be solved by a more direct method, 
usually A —C, or A+C. These prob- 
lems are called crttical, or test prob- 
lems. Their original purpose was to 
show the effect of the set engendered 
by the Einstellung problems. The 
use of the set solution for the critical 
problems was taken as evidence of 
the establishment of a set. An ex- 
ample critical problem is shown be- 
low. 

21 


A 


| 


Sl = Obtain 12 


In his 1942 monograph (34) Lu- 
chins added a third type of problem, 
the extinction problem. Extinction 
problems are amenable to solution 
by the direct A—C and A+C 
method, but not by the indirect 
B—A-—2C method. The extinction 
problem was originally conceived as 
an attempt to break the Einstellung; 
its success or failure was determined 
by a subsequent group of critical 
problems. An example extinction 
problem is shown below. 
| 19 | Obtain 26 


| 65 | 7 | 
A B Cc 
Luchins has consistently regarded 
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the water-jar test as a paradigm of 
human learning. In 1942 (34) and as 
recently as 1954 (40), he made clear 
that he believes that the primary 
value of the test lies in its educa- 
tional implications rather than in 
clinical use. Nonetheless he ap- 
parently developed some kind of 
clinical index of rigidity from the 
water-jars (37),? although he has 
never published any norms or other 
developmental data. In 1948, Roke- 
ach (47) published the first account 
of the water-jar test as an experi- 
mental rigidity measure. He re- 
ported that there was a relationship 
between the number of short or di- 
rect solutions of the critical problems 
and scores on the California E scale. 
The relationship was in accord with 
certain theoretical considerations in 
which the ethnocentric individual is 
conceived of as having a gener rally 
rigid personality structure. 

In Rokeach’s final form of the 
water-jar test, extinction problems 
were not used, and the set solution 
of the critical problem constituted 
the definition of rigidity. Rokeach 
added the control problem, one of 
Luchins’ variations (34), in his de- 
sign. A control problem is a critical 
which is administered prior to the 
presentation of the set problems. Ss 
who failed to solve the control prob- 
lem by the short method were elim- 
inated from the experiment. The 
expressed purpose of this problem 
was to equate the experimental Ss 
“in that they all demonstrate their 
ability to solve a critical problem by 
the simple method” (47, p. 263). 


The publication of Rokeach’s find- 


2 Luchins’ manual has been out of print 
since 1950, and only a limited number of 
copies was ever printed (Cf. 29). The present 
writer has so far been unable to obtain a copy 
to determine whether developmental test data 
are presented. The test form is probably 
similar to those suggested by Luchins else- 
where (35). 
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ings provided the impetus for a con- 
siderable amount of experimental 
work with the water-jar test. It also 
gave rise to a controversy between 
Rokeach and Luchins (Cf. 36, 48) 
which has not yet been resolved. A 
center of the dispute was Rokeach’s 
exclusion of the extinction problem 
from his form of the test. Luchins 
has emphasized (36, 38, 39) that the 
extinction problem is the proper 
mechanism to tap rigidity since its 
solution requires a shift of method, 
while that of the critical does not. 
If rigidity is defined as ‘“‘the inability 
to change one’s set when the objec- 
tive conditions demand it,”’ only the 
extinction problem logically fulfills 
the requirements of the definition. 
The contention appears reasonable, 
especially in light of evidence (31) 
that Ss who use the short critical 
solution do not work any faster than 
those using the indirect solution. 
However, the controversy cannot be 
satisfactorily settled by argument 
alone. Logic is a secondary consider- 
ation when experimental investiga- 
tion is possible. 

The primary purpose of the present 
paper is to examine the validity of 
the water-jar test as a rigidity meas- 
ure by critically reviewing studies in- 
volving its use as such an index. It 
is hoped that the controversy be- 
tween Luchins and Rokeach will be 
resolved along the way. 

Since this review is concerned with 
the water-jar test as a rigidity meas- 
ure, a number of studies (6, 28, 34, 
41, 49, 57, 58) in which the test was 
manipulated to investigate the ef- 
fect on Einstellung, but in which it 
was not used as a rigidity measure, 
will not be considered here. 


TEST VARIATIONS 


“‘water-jar test” 


The expression 
has been used thus far in its generic 
sense, for there are actually a number 
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of different experimental forms of the 
test. We can distinguish four basic 
types which differ among themselves 
with respect to the kinds of problems 
which make up the form. The sim- 
plest of these we have labeled the 
Zener-Duncker form; it consists of a 
series of set problems followed by a 
series of criticals. The Luchins form 
consists of sets, criticals, and extinc- 
tions. A modification of this form 
adds a series of criticals following the 
extinctions. The Rokeach form has 
a control critical problem followed 
by sets and criticals. The Cowen 
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measures are listed below in the form 
in which a high score indicates rigid- 
ity. The individual experimenter 
may compute the opposite or non- 
rigidity score. Each measure is fol- 
lowed parenthetically by an appro- 
priate abbreviation which will be 
used to represent it in the remainder 
of this paper. 

1. Number 
solved by 
method (Cr). 

2. Number of failures to solve ex- 
tinction problems (Ex). 

3. Number of long critical solu- 


critical 
long, or 


of 
the 


problems 
indirect 


TABLE 1 


SEQUENCE OF TYPES OF PROBLEMS IN FORMS OF THE WATER-JAR TEST 


Zener- 


Form 
Duncker 


Luchins 





sets 
criticals 


sets 
criticals 
extinctions 


sets 
criticals 


criticals 


form consists of the modified Luchins 
form preceded by the control critical. 
A modification of this form has a set 
problem inserted between the first 
criticals and the extinctions. The 
sequence of the various forms are 
shown in Table 1. A few unlisted var- 
lations have been 
sionally. 

In addition to the different forms 
of the test, there are also various 
operational measures of rigidity which 
are derived from the forms. 

Most of the water-jar studies use 
one or more of four primary meas- 
ures. A fifth measure, time of solu- 


also used oOcCa- 


tion of a problem, can be applied to 
any of the four, though in practice 


it is usually an extinction. These 

’ The forms are identified by the name of the 
person who first used each one, as nearly as 
can be determined from the literature. 


extinctions 


Rokeach Cowen 


control control 
sets 

critic als 
extinctions 
criticals 


control 
sets sets 
criticals 
set 
extinctions 
criticals 


criticals 


tions and number of extinction fail- 
ures pooled into a single score (CrEx). 

4. A specified number of long criti- 
cal solutions and extinction failures 
used as a multiple cutoff point to 
distinguish a “rigid’’ group of Ss 
(Cr+ Ex). 

Considering combinations of form 
and measure, nine different opera- 
tional definitions of rigidity (exclu- 
sive of type of administration) based 
on the water-jar test have been used 
in correlational Two of 
these definitions involve time meas- 
ures. Since the number is small, 
these will be included with their re- 
spective form-measure combination 
in compiling results, leaving only 
seven definitions. 

The volume of rigidity studies is 
too limited to furnish a conclusive 
evaluation of the predictive validi- 


studies. 





350 


ties of so large a number of defini- 
tions. However, there are sufficient 
studies to permit at least an inspec- 
tional comparison. To facilitate this 
comparison, thirty-one investigations 
of the relationship between the water- 
jar test and criterion indices have 
been classified according to signifi- 
cance of results. A study is classified 
as positive if more than 75 per cent 
of the reported correlations were sig- 
nificant at the .05 level or beyond. 
If less than 25 per cent were signifi- 
cant, it is classified as negative. The 
remaining studies are considered am- 
biguous. 

A breakdown of the significance 
of results as a function of form, meas- 
ure, and method of administration, 
each independently, is shown in 
Table 2. A similar breakdown ac- 
cording to combinations of form and 
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measure is given in Table 3. Studies 
making up the frequency in each 
category of Table 3, and in the form 
analysis of Table 2, are listed paren- 
thetically alongside the frequency. 
The frequencies in both tables are 
obviously too small to warrant a 
statistical analysis, but inspection 
suggests that no particular form, 
measure, type of administration,‘ or 
combination of form and measure is 
superior to any of its fellows. This 
cautious conclusion derives addi- 
tional support from the fact that two 
of the three positive experiments us- 
ing the Rokeach form (53, 54) stem 


4 Experience with the WJ]T suggests that it 
is quite sensitive to various conditions of ad- 
ministration, especially the instructions to 
Ss. However, these conditions are not speci- 
fied in many of the studies, so that evaluation 
of their effects is not worth attempting. 


TABLE 2 


BREAKDOWN OF RESULTS ACCORDING TO FoRM.* MEASURE AND ADMINISTRATION 


Frequency of 


——— Total 





Form 


Positive Results 


Ambiguous Results 


Negative Results 





Zener- 
Duncker 
Luchins 
Rokeach 
Cowen 


Total 


Measure 
Cr 
Ex 
CrEx 
Cr+Ex 
Total 
Administration 
Group 1 
Individual 1 


Not Stated 3 


Total 5 10 


3 (20, 43, 59) 
7 (2, 4, 23, 26, 30, 32, 39) 


3 (22, 27, 35) 
3 (14, 19, 51) 


16 


| sau 


w 
_ 





* The bibliographic numbers of the studies in each category of the form-breakdown are shown parenthetically. 


t One is a time measure. 
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UNG TEST AND RIGIDITY 


TABLE 3 


BREAKDOWN OF RESULTS ACCORDING TO COMBINATIONS OF FORM AND MEASURE* 


Combination wos 
, Positive 


Results 


L (Cr) 0 
L (CrEx) 1 (42) 
L (Ex) 2 <2) 
L (Cr+Ex 1 (50) 


Total 
C (Cr) 


C (CrEx) 
C (Cr+Ex) 


Total 


Z-D (Cr) 
R (Cr) 


Over-all Total 


L =Luchins form, ¢ 


providing the freq 


S| qu v 
separate Cr measures in th 


from a single investigation (52). This 
fractionated dissertation also swells 
the positive results category for the 
Cr measure, and for ‘‘Not Stated”’ 
under type of administration. 

In view of the data of Tables 2 
and 3, it seems reasonable as well as 
expedient to regard the various oper- 
ational definitions based on the 
water-jar tests as equivalent. Test 
variations will hence be ignored in the 
discussions in the remainder of this 
paper, and results from different in- 
vestigations will be pooled, when 
necessary, regardless of differences in 
form and measure. 

Criticals vs. extinctions. The data 
in’ Tables 2 and 3 also provide a 


5 This does not mean that the forms are 
equivalent in every sense. Means and vari- 
ances furnished by the forms are not neces- 
sarily the same. The equivalence is one of 
predictiveness (or lack of predictiveness) 
rather than of descriptive data. 


Frequency of 


Ambiguous Negative 


Results Results 


3 (1, 23f) 3 
(17, 42) 2 (4, 26) 5 
(5 4 (23, 30$, 32, 42) 6 
2 (2, 39) 3 


11 


0 


3 (20, 433, 59) 
3 (22, 27, 35) 


20 


rm, R =Rokeach form 
etically 
d tho 


means of evaluating Luchins’ con- 
tention that the Ex problem rather 
than the Cr is the proper unit for the 
measurement of rigidity. Only one 
of six instances of the use of Ex fur- 
nished a significant relationship, the 
lowest percentage of any of the four 
measures. It is represented by a sin- 
gle chi square based on a small group 
of Ss. If we include the Cr+ Ex meas- 
ure® the ratio becomes two of ten in- 
stances, or 20 per cent compared with 
18 per cent for Cr and 25 per cent for 
CrEx. A total of three out of twelve 
individual correlational analyses us- 
ing Ex and Cr+Ex measures at- 
tained the .05 level of significance, 
and one of these (50) is a questionable 
result. As we shall see, this propor- 
tion is about the same as that of sig- 


6 Luchins (39) has recently stated that the 
Cr+Ex measure should be a satisfactory 
rigidity index. 
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nificant correlational analyses with all 
forms (see p. 352, below). Evidently, 
the extinction problem is no more pre- 
dictive than the critical problem 
when both are evaluated against in- 
dependent criteria. 

The experiments of Guetzkow (23) 
are usually cited as experimental evi- 
dence that the Cr and Ex are differ- 
ent measures. He reported that men 
and women Ss perform similarly on 
the Cr’s, but that a_ significantly 
higher proportion of men solved the 
Ex. On this basis, Guetzkow sug- 
gests that there are two different proc- 
and factors involved; 
the Cr is concerned with acquisition 
of set, while the Ex is involved with 
surmounting the set. The data them- 
selves offer no real basis for such a 
notion. They merely show the dif- 
ferential performance of the sexes. 
Seventy-eight per cent of the Ss who 
used the short Cr solution also solved 
the Ex. Guetzkow that this 
lends weight to his different process 
idea since there was no sex difference 
in the 78 per cent. This is indeed a 
peculiar conception. If a fair propor- 
tion of the sample manifests both 
types of behavior, then there is likely 
to be a substantial correlation be- 
tween solutions to the two kinds of 
problems. A correlation of .53 be- 
tween the two problems has been re- 
pe cd elsewhere (32). Such results 
can hardly lead one to conclude that 
there is “a clear and verified distinc- 
tion’”’ between the processes involved 
in solving the two types of problem. 
Furthermore, median test analyses of 
Harris’ data (24) show that when 
time of solution is the measure, there 
are no differences between males and 
females on either problem. In any 
event, the data of the correlational 
analyses indicate that the Ex is no 
better than the Cr as a unit for the 
measurement of rigidity, no matter 


esses causal 


feels 
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what other hypotheses one might 
care to entertain. 


THE WATER-JAR TEST AS A 
RIGIDITY MEASURE 


In addition to suggesting the 
equivalence of the sundry definitions 
based on the water-jar test, the data 
of Tables 2 and 3 also indicate that 
the Water-jar test (WJT) lacks pre- 
dictive validity. Sixteen of the thirty- 
one studies report negative results. 
An additional ten have ambiguous 
findings. Only five can be considered 
as supporting the claims for the test 
as a rigidity measure, and it must be 
remembered that two of these are off- 
shoots of a single dissertation (52). 

The number of individual correla- 
tional analyses varies considerably 
among the 31 studies. Eleven studies 
have only a single correlation, but 
some report as many as 30 or more. 
There is a total of 202 individual 
analyses of the relationships between 
the WJT and other tests. These in- 
volve 111 measures derived from 66 
different instruments, exclusive of 
tests of intelligence. Of the 202 cor- 
relations, 151, or 74.75 per cent do 
not reach significance at the .05 level 
or beyond. If we allow for the ten 
correlations which are probably sig- 
nificant by chance alone, the percent- 
age of insignificant correlations rises 
to about 80, roughly the same as the 
percentage of negative and ambigu- 
Table 2. 

The high proportion of insigniti- 
cant correlations, like the similar pro- 
portion of negative and ambiguous 
studies, indicates that the WJT lacks 
criterion validity. It is pertinent to 
inquire, however, whether any par- 
ticular criterion test has been found 
to be more consistently related to the 
WIT than others. Evaluation is 
hindered by the fact that only 12 of 
the 66 criterion tests have been in- 


ous studies shown in 
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vestigatedinreplicatedstudies. None- 
theless, it may be revealing to ex- 
amine the relationship of the WJT 
and individual criteria. 

The California scales. Of the 12 
tests in replicated researches, the Cal- 
ifornia E and F scales have been the 
objects of most attention, probably as 
a joint result of the use of the E scale 
in Rokeach’s provocative study (47 
and the popular theory linking rigid- 
ity and the antidemocratic person- 
ality. In view of the usually high cor- 
relation between the two scales, they 
will be considered as one measure for 
the analysis of this section. 

There are nine studies (4, 8, 17, 20, 
27, 31, 32, 47, 59) in which either the 
the F 


criterion measure. 


scale has been used as a 
A total of 1,088 


E or 
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Ss have been involved. The results 
of these investigations are summar- 
ized in Table 4. 

There are 18 individual correla- 
tional analyses in the studies.’ Fif- 
teen of these are correlation coeff- 
cients of various sorts while three are 
t tests of differences between mean 
WIT for ethnocentric and 
nonethnocentric groups, or between 
E- or F-scale means for rigid and non- 
rigid groups on the WJT. Of the 18 
correlational analyses, only five reach 
the .05 level of significance or be- 
vond. Assuming that the various co- 
efficients are equivalent 


scores 


estimates, 
7 Results for a group of children in 
47) are not included here in 
the interests of uniformity of research popula- 


tions. 


Rokeach’s study 


rABLE 4 


REPORTED RELATIONSHIPS BETWEEN THE WJT AND THE CALIFORNIA SCALES OF THE 
ANTIDEMOCRATIC PERSONALITYT 


] 


Corre 


Technique 
29 tau 
4 


82 
33 
50 
50 


50 


lational 


Special Experimental 


Cnntiicina Result 
aiter stress —.15 
34* 

—.21 
00 
.40** 
06 
06 
10 

ee 


r stress 


20 none 

37 reward incentive 

+ 

16} 

23% reward incentive 
32 29 
47 70 
59 262 
135 


‘ 


Average of correlation coefficients 


none 


- 
29 


Average of all analyses (see text 


Average of all analyses of data obtained under special conditions 


* Significant at the .0S level 

gnificant at the .01 level 

Tt The signs of some correlati hanged so that 

rigidity tarianism scores vary in the same direct 
ores greater than zero ir 


eS 
ns have beer 
1 ores and auth tari 

~ Only Ss with WIT s luded 
§ A group of female Naval officers 


n always ir 


+significant 
nonsignificant 
— .13 
09 
— .18 
+significant 
— .03 
.07 
.30** 
04 
07 
05 


idicates that WJT 
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the average of the 15 is .04. If we as- 
sume further that an insignificant ¢ 
equals a coefficient of zero, and that 
a significant ¢ equals a coefficient of 
.40, the average correlation becomes 
.07. Evidently the WJT and the Cal- 
ifornia scale of the antidemocratic per- 
sonality are not related indices. 

The data in Table 4 also provide a 
means of testing Brown’s (8) hy- 
pothesis that a significant relation- 
ship between authoritarianism or 
ethnocentrism and the WJT is a func- 
tion of stressful or ego-involving ex- 
perimental conditions. There are 
seven correlations (4, 8, 20, 31, 59) 
computed from the data of 509 Ss 
who performed in this tvpe of atmos- 
phere. Only two of these are signifi- 
cant, one being furnished by Brown 
himself. The average using the crude 
conversion from ¢ to r as in the previ- 
ous paragraph is only .05. The aver- 
age for the remaining 11 correlations 
is .08. (The respective averages with- 
out the converted ts are .06 and .02.) 
The data thus do not support 
Brown's hypothesis, even when his 
own results are included among them. 

Further analysis of the results in- 
dicates that stressful experimental 
conditions also do not increase the 
predictiveness of the WJT for meas- 
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ures other than the FE and F scales. 
Of 28 such correlational analyses 
only one is significant at the .05 
level. Apparently, the data will not 
bear out a conclusion that the WJ]T’s 
lack of criterion validity is a func- 
tion of experimental circumstances. 

The Rorschach. Four studies have 
been concerned with the relationship 
between the WJT and various Ror- 
schach indices. <A total of 25 such 
indices have been used in the studies; 
however, only 12 will be found in 
replicated studies. Of these, 11 (high 
F%, low R, high A%, high F+%, 
low FC, high W, high Dd, low M, low 
FY, low content range, and slow reac- 
tion time for first response to each 
card) comprise the Fisher Rigidity 
Score. In the Applezweig (4) and 
Katz (27) reports, the Fisher score is 
given as a unit. Both report nonsig- 
nificant relationships with the WJT. 
Cowen and Thompson (15) used 8 of 
Fisher's 11 measures with child Ss, 
and found that 4—R, content range, 
time of first response, and F+ 7 were 
related to the WJT. 

Both Katz and Cowen and Thomp- 
son found a significant relationship 
between the WJT and Af+C. Of 8 
indices, this is the only one found to 
be related by Katz. Those which he 


TABLE 5 


Supject Loss DuE TO EXPERIMENTAL STANDARDS REPORTED IN SIXTEEN WATER-JAR 


Standard 


Number of 
Ss Lost 


Test Stupres WiTH AN ORIGINAL TOTAL OF 2,385 SUBJECTS 


Percentage of Ss 
Lost in Studies 
Using This 
Standard 


Percentage of Ss 
Lost from Total 
Sample 





Set solution of a requisite number of 





set problems 221 
Short (or long) control solution 181 
Arithmetic accuracy 119 
Pooled standards* 113 


24.97 
20.43 
22.54 
24.84 
Total Loss 634 26.58 








* A pooled loss due to multiple standards is reported. Included in this figure are 32 Ss whose loss is not explained. 
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reports as unrelated include’ the 
Reichard Prejudice Score, the Gibby 
Stability Score, and judges’ ratings 
of inflexibility and emotional con- 
struction. Cowen and Thompson 
used 18 different indices of which 8 
correlated significantly with the WJT. 
However, the fact that their sample 
consisted of children tends to vitiate 
comparisons with the other studies. 

Total R was reported to be related 
to the WJT by Cowen and Thomp- 
son and by Eriksen and Eisenstein 
(17). Katz, however, failed to con- 
firm these findings. 

Other measures in replicated studies. 
The relationship between the Wechs- 
ler-Bellevue Similarities subtest and 
the WJT has been investigated by 
Luchins (39) and Horwitz (26). The 
former reports a negative result while 
the latter gives a significant correla- 
tion of —.30. 

Maltzman, 
(42), and 


Morrisett 
each made 


and 
(20) 


Fox, 
French 


two analyses of the WJT and the 


Taylor Anxiety Scale. Of the four 
analyses, only one of Maltzman’s is 
significant. 

The alphabet maze has been used 
in three studies. Cowen, Wiener, and 
Hess (16) report a significant r of .42, 
while the correlation in Vallance’s 
study (59) is insignificant. Bakan 
(5) reports a significant r of .26, but 
it is the result of averaging four in- 
dividual coefficients which are not in- 
dependent, and is therefore a biased 
estimate. Only one of the four indi- 
vidual coefficients is significant. 

Katz (27) and Schmidt, Fonda, 
and Wesley (50) have examined the 
Wesley Rigidity Scale in relation to 
the WJT. Katz found an insignifi- 
cant relationship, while the latter ex- 
perimenters claim to have found a 
relationship. However, their data 
analysis is questionable. They di- 
vided a group of Ss into three sub- 
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groups on the basis of WJT scores, 
minimally, equivocally, and maxi- 
mally rigid, and compared mean 
rigidity scale scores by ¢ tests. Only 
one of the three ¢ tests, that between 
the minimally rigid and the maxi- 
mally rigid groups, was significant. 
This ¢ test furnishes the basis for the 
claim that a relationship exists. In 
designs of this type, the proper pro- 
cedure is first to compute an analysis 
of variance of the scores of all three 
groups. If a significant F does not re- 
sult, significant individual ¢ tests can 
not be regarded as indicating real dif- 
ferences. The over-all F for the 
three groups in the Schmidt ef al. 
study is 2.89, which falls short of the 
.05 level. Therefore the ¢ test upon 
which the relationship claim is based 
is specious, and the study must be 
regarded as having essentially nega- 
tive results. 

Oliver (45) and Horwitz (26) found 
no relationship between the WJT 
and mirror writing of letters and 
words. 

Horwitz (26) and Eriksen and 
Eisenstein (17) related the WJT to 
performance on _ reversible figures. 
Both studies used the reversible 
staircase and the Necker cube. Hor- 
witz also used the reversible profile. 
In the Eriksen-Eisenstein work, per- 
formance on the two figures was 
grouped into a single score. None of 
the correlations are significant except 
for the Necker cube in the Horwitz 
experiment. However, the coeffi- 
cient of .30 is in the opposite direction 
from what would be expected if the 
WJT was measuring rigidity. 

Applezweig (4) and Horwitz (26) 
computed correlations between the 
WJT and the Hidden Words Test. 
None of four individual correlation 
coefficients is significant. 

Three different measures derived 
from Maier’s two-string problem 
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have been used as criterion indices. 
Adamson and Taylor (1) obtained 
one significant chi square of three in 
attempting to relate the WJT to a 
“functional fixedness”’ score. Guetz- 
kow (23) reports two insignificant 
analyses, one based on a “‘stereo- 
typy ratio” and the other on correct- 
ness of solution of the two-string 
problem. 

The relationship between the WJT 
and level of aspiration measures on 
the Rotter Board has been investi- 
gated by Horwitz (26) and Harway 
(25). The latter employed 11 differ- 
ent measures, and analyzed differ- 
ences in both means and variances 
for a rigid and nonrigid group differ- 
entiated by the WJT. Two of Har- 
way’s measures, number of unusual 
shifts of estimate (1.e., up after fail- 
ure, down after success) and the ab- 
solute discrepancy between estimates 
from trial to trial, were replicated 
by Horwitz. His correlations are 
Harway’s ¢ 
for the second measure reaches the 
.05 level of significance. Of Harway's 
11 measures, four show both signifi- 
cant mean and variance differences. 
In addition to absolute discrepancy 
between estimates, real differences 
were found for the variation of esti- 
mates from the mean estimate, the 
average magnitude of the shifts, 
and the variation of the magnitude 
of shift from the mean. Harway also 
derived his 11 measures from the 
WJT itself and from the Hidden 
Words Test. There were only two 
significant mean differences for the 
WJT, and only one of the Hidden 
Words, a total of 7 of 33 mean dif- 
ferences derived from the three tests. 
However, there were five significant 
variance ratios from the WJT, and 
seven from the Hidden Words, a 
total of 16 for the three instruments. 
Of the seven mean differences, six 


both insignificant, but 
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also have significant variance dif- 
ferences. Variance differences are 
difficult “to interpret, especially in 
the absence of accompanying mean 
differences. Certainly there is no 
particular theory or hypothesis con- 
cerning personality rigidity which 
would easily encompass variance dif- 
ferences. Such differences may have 
a real theoretical meaning, but it 
would be unduly eptimistic to say 
that they reflect favorably on the 
validity of the WJT as a rigidity 
measure, especially since the “same 
statement obviously cannot be made 
for the mean differences. 

There are a number of tests used 
in WJT investigations which are not 
found in replicated studies, but which 
may be grouped under certain usual 
headings. Four of these are what are 
commonly regarded as tests of con- 
cept formation. Forster, Vinacke, 
and Digman (19) found no relation- 
ship between the Vigotsky and the 
WJT, and between another sorting 
test and the WJT. Katz (27) found 
a similar insignificance for the Wis- 
consin Sorting Test. Solomon's (54) 
“organization of biology concepts” 
scale did relate to the WJT at the .05 
level. 

Several investigations deal with 
the relationship between the WJT 
and emotional adjustment. Cowen 
and Thompson (15) found no rela- 
tionship between the WJT and the 
Bell Adjustment Inventory or the 
California Test of Personality, though 
again, the use of a child population 
precludes comparison with other stud- 
ies. Ainsworth (2) derived an ad- 
justment score from a security-inse- 
curity inventory, which turned out 
to be unrelated to the WJT. Meer’s 
(43) results with the Maslow Secu- 
rity-Insecurity index were also nega- 
tive. Horwitz (26) and Levine (30) 
compared groups of normals and 
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In neither case 
differences in 


psychiatric patients. 
were any significant 
WIT scores obtained. 
Three studies made use of abstract 
reasoning tests. Insignificant correla- 
tions with the WJT were reported by 
Forster et al. (19) for the Duncker 
reasoning tasks; the matchbox, cork, 
X ray, and “13."" Solomon (55) found 
that the WJT was related to re- 
sponses to only one of four science 
laboratory course 
the common 
Sivers (51 
apparently did not find a relation- 
ship between the WIT and Form A 


of the Abstract Reasoning Test; the 


questions after a 
designed to overcome 


misconceived answers. 


data are not presented in sufficiently 
clear form to be certain. 

Two experiments dealt with instru- 
ments which may be thought of as 
measuring perceptual intolerance of 
ambiguity. Relationships between 
the WJT and the Mooneyv-Ferguson 
Closure Tests I and II, the Frenkel- 
Brunswik Figures Test, 
and the Levy Design Preference Test 
were computed by French (20) for 
two groups. None of the eight coef- 
ficients is significant. Eriksen and 
Eisenstein (17) found that the WI] 
related to ‘‘availabilitv of hv- 
potheses,”’ 1.e., the number of guesses 


Changing 


was 
as to the identits ot objects shown 
tachistoscopically at subrecognition 
speeds. 

Other perceptual tasks included 
the Anevl dots (4), Hidden Objects 
4), and speed of recognition of ta 
chistoscopically-presented words pre- 
ceded by erroneous expectancy (17). 
Of seven correlations, 
Hidden Objects 


only one 


is significant. Two 
8 The lack of clarity in no way reflects on 
Sivers’ abilities. A study of the relationship 
between the WJT and the ART was not part 
of his design. The present writer has esti- 
mated the degree of relationship from data 
presented by Sivers for other purposes. 
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of the other correlations of the WJT 
with Hidden Objects are not signifi- 
cant. 

Motor and perceptual tasks. A num- 
ber of motor or perceptual-motor 
tasks have been used in the WJT 
studies. These include mirror writ- 
ing, word construction, figure simi- 
larities, maze tracing, code decipher- 
ing, arithmetic speed, etc. The AAF 
Aviation Psychology Program Re- 


search Report No. 5 (21) lists seven 


which fall into 
Three of the seven were 
significantly related to the WJT, but 
the highest of the three coefficients 
was only .18. Three of five motor 
tasks in Oliver's battery (45) were 
related to WIT scores; the highest 
is .25 for the Gottschaldt 
None of the three tasks used 
by Horwitz (26) was found to be 
related to the WJT. A contour-draw- 
19) was also unrelated. 

Miscellaneous Rela- 
reported in unreplicated 
in studies which do not 
fall into usual groupings, are of less 
import in evaluating the WJT. How- 
ever, a number of such attempts are 


“change of set tests 


this group. 


coefficient w 
figures. 


i gy test 
measures. 
tionships 

studies, or 


listed here for purposes of complete- 
ness. 

Fight were ad- 
ministered to a group of Ss also per- 
forming on the WJT by Goodstein 
(22). None of the correlations was 
significant, the highest 
03. Goodstein also found a similar 


Thurstone scales 


being onl, 
absence of association for an ana- 
grams test, and for the Shiplev-Hart- 
ford Retreat Scale. Peer ratings were 
reported to be unrelated by Vallance 
(59). Decision time as measured by 
the Festinger-Wapner test did not 
distinguish rigids and nonrigids, ei- 
ther among normals or psychiatric 
patients (30). Cowen (14) found 
that the WJT failed to discriminate 
high and low “‘negative self-concept” 
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scorers on the Brownfain Self-Rating 
Inventory. 

On the positive side, 20 of 33 items 
in Solomon’s “‘aspects of scientific 
method”’ scale (53) successfully sep- 
arated high and low scorers on the 
WJT. Brown (8) found that the WJT 
was related to n Achievement under 
ego-involving conditions, but not in 
an ordinary experimental situation. 
Solomon (56) reported that stutterers 
tended to show more WJT rigidity 
than nonstutterers. 

Measures of intelligence. Luchins 
reported in his 1942 monograph (34) 
that there was no relationship be- 
tween WJT scores and intelligence. 
No correlation coefficients or other 
statistical demonstrations are pre- 
sented. He based his conclusion on 
the fact that differences in Einstel- 
lung effect varied only slightly among 
groups of different ages and educa- 
tional levels. Such variation as did 


occur was attributed to “differences 
in attitudes towards and interpreta- 


tions of their tasks and instructions, 
rather than sheer differences in age 
or educational level’ (34, p. 19). 
However, the fact that there is no 
correlation between mean scores of 
groups and intelligence does not pre- 
clude the possibility of significant 
correlations within groups. A more 
objective evaluation can be obtained 
from the results of 12 studies in which 
the relationships between the WJT 
and seven different measures of intel- 
ligence are reported. The Cowen 
and Thompson work (15) with chil- 
dren is again considered separately. 
They report no relationship with the 
Pintner General Abilities Test. 
Rokeach (47) found no association 
between either the Stanford-Binet 
or the Wechsler and the WJT in an 
adolescent group. Absence of rela- 
tionship is reported by Applezweig 
(4) for the Navy GCT, and by Hor- 
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witz (26) for the Wechsler. Horwitz 
administered only three subtests, 
Arithmetic, Comprehension, and 
Similarities. The total score is unre- 
lated to the WJT, but the latter two 
subtests have significant individual 
correlations with the WJT. Katz 
(27) found no relationship between 
the WJT and a composite score on 
lowa Entrance Examination tests. 
French (20) found a similar lack of 
correlation for the AFQT. 

Five studies involved the WJT 
and the ACE. Four of these (5, 7, 
43, 45) report significant negative re- 
lationships between the two (high 
rigidity, low intelligence). The re- 
maining study (53) did not find a 
relationship. Vallance (59) found low 
but significant correlations between 
the WJT and academic grades in en- 
gineering and navigation obtained 
by students at a Navy OCS. 

Again assuming that all correlation 
coefficients are equivalent estimates, 
the average correlation based on 
1,218 Ss in nine studies is —.17. 
Since no coefficient is reported by 
Benedetti and Douglas (7), their 
findings are not included. It may be 
reasonably assumed that the inclu- 
sion of their result would raise the 
average coefficient to about —.20. A 
small portion of the variance of WJT 
scores is thus probably a function of 
intelligence. However, this conclu- 
sion should be viewed with caution 
since the correlation of —.20 is 
largely a result of relationships ob- 
tained with the ACE, especially the 
“QO” subtest. Other tests yielded 
mostly insignificant results, though 
almost all were in the right direction. 

We conclude that no particular 
test or type of test except tests of in- 
telligence, appear to be consistently 
or clearly related to the WJT. The 
conclusion must be tempered in light 
of the multiplicity of instruments 
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used and the lack of replication. It is 
most particularly applicable to the 
California scales, which have long 
been considered as a rigidity criterion 
for performance measures. It ap- 
pears to be more or less applicable to 
the Rorschach, to tests of concept 
formation, to emotional adjustment, 
to reasoning tests, and to various per- 
ceptual and motor tasks. 


FACTOR-ANALYSIS STUDIES 


Factor analyses of batteries of 
tests including the WJT were per- 
formed by Horwitz (26) and Oliver 
(45). Apparently such an analysis 
was intended for the “change of set”’ 
tasks in the AAF program (21), but 
the plan was abandoned when only 
seven of the 28 r’s turned out to be 
significant, the largest being only .18. 

Horwitz’ battery included nine 
tests. Separate analyses were done 
for the normals and psychiatric pa- 
tients. The results in each instance 
were much the same. <A _ problem- 
solving rigidity factor was heavily 
loaded with IQ, leading Horwitz to 
conclude that low intelligence is ‘fan 
important determinant in problem 
solving rigidity’’ (26, p. 70). Hor- 
witz also derived a ‘‘strength of set”’ 
measure from the WJT by interview- 
ing Ss to determine the method used 
in solving the water-jar problems. He 
grouped responses under four head- 
ings ranging from those which tended 
to establish the strongest set to those 
which led to the weakest. The 
“strength of set’’ measure was in- 
cluded in another factor in which 
arithmetic ability had a heavy lead- 
ing. Hence Horwitz concludes that 
poor arithmetic skill accounts for the 
establishment of a weak set in the 
WIT. 

Horwitz’ general conclusion is that 
“the E:nstellung tests appear with 
strong loadings on the intelligence 
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factor but fail to cluster with any 
of the other rigidity tests’’ (26, p. 
97). His findings are weighted further 
by his use of the Wechsler as an IQ 
measure rather than the ACE. 

Oliver included ten measures in his 
battery, of which five were motor or 
perceptual-motor tasks which he had 
developed, three were ACE subtests, 
and the last was the Gottschaldt fig- 
ures. Of the three factors extracted 
by Oliver, the WJT contributed only 
to General Reasoning Ability, a fac- 
tor composed mostly of the ACE 
tests. It had a slight negative weight- 
ing for the Disposition Rigidity fac- 
tor. Oliver concludes that if his Dis- 
position Rigidity factor is validly 
labeled, then the WJT does not meas- 
ure this characteristic. However, as 
in the Horwitz analysis, the WJT 
appears to be clearly involved with 
intelligence. 

The factor analyses of Horwitz 
and Oliver support the results of the 
correlational studies of the WJT and 
intelligence tests, and lend credence 
to the hypothesis that scores on the 
WJT are in part a function of intel- 
ligence. 


PERFORMANCE ON THE WJT 
UNDER STRESS 


A number of studies attempt to 
demonstrate the validity of the WJT 
as a rigidity measure without the use 
of a criterion test. The basic design, 
reasoning, and intent of these studies 


is relatively uniform. The hypothe- 
sis under examination is that rigidity 
will increase as a function of stress. 
The Ss who perform under conditions 
of ego-involvement, anxiety, frustra- 
tion, anticipated failure, and so forth, 
will manifest a greater frequency of 
long solutions to the water-jar prob- 
lems than individuals to whom the 
test is administered under nonstress- 
ful circumstances. If the experi- 
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mental results are in accordance with 
the hypothesis, it is customary for 
the experimenter to accept his re- 
sults as evidence that the WJT is a 
valid measure of rigidity. The logic 
of this conclusion will be discussed in 
a subsequent section of this paper. 
For the moment, we are concerned 
with results obtained in experiments 
of this nature. 

Investigations of the effects of 
stress on the WJT scores have been re- 
ported by Christie (10), Harris (24), 
Pally (46), and Cowen (12, 13). The 
studies of Christie and Harris are 
practically replicates; both used the 
same design and the same WJT meas- 
ure, time required to solve a single 
Ex problem following a series of sets 
anda Cr. Christie found that 15 frus- 
trated Ss took a mean of 157.66 sec- 
onds to complete the Ex while a 
like number of unfrustrated Ss _ re- 
quired only 69.87 seconds on the 
average. The critical ratio of the dif- 


ference is reported as 2.04, with a p 


of .02. The p value is obviously in- 
correct; the ¢ reaches only the .05 
level for df=28. If a one-tailed test 
was intended, it is not so stated, nor 
would such a test be appropriate. In 
addition, the ratio of the variances 
for the two groups is 3.61, which is 
significant beyond the .01 level, and 
contraindicates the use of a table of 
Student’s ¢ for determining the p 
value. If we apply the adjustment 
for heterogenous variance suggested 
by Cochran and Cox (11), we find 
that the ¢ required for significance at 
the .05 level is 2.14, and the differ- 
ence between Christie’s groups is not 
significant. On the other hand, the 
distributions of scores are skewed, 
a fact mentioned by Harris as lead- 
ing to his use of a log transformation 
of the data. If we apply a median 
test to Christie’s data, we obtain a 
chi square of 4.80, significant beyond 
the .05 level for 1 df. 
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The facts are more or less reversed 
for Harris’ data. Using the log trans- 
formation, he finds that 18 Ss in the 
stress group required 202.94 log sec- 
onds to solve the Ex, while the non- 
stress Ss needed only 55.56 log sec- 
onds on the average. The ¢ is given 
as 2.59, which is significant at the .01 
level. However, analysis of the raw 
data using a median test results in a 
chi square of 2.78 which falls short 
of the .05 level of significance. 

The interpretation of Christie's 
and Harris’ results is thus wide open, 
with the particular choice depending 
in large part upon which statistical 
analysis the interpreter wishes to 
credit. Certainly neither study mer- 
its the unqualified citations which 
they have received in later publica- 
tions. 

Pally’s study (46) is cumbersome, 
poorly reported, and difficult to eval- 
uate. He divided his Ss into four 
groups of 20 each, Groups A and B 
experiencing failure on tests preced- 
ing the WJT administration, Group 
C experiencing success, and Group 
D being neutral. The WJT had 10 
Cr's followed by an Ex. Once an S 
succeeded in solving a Cr by the short 
method, the experiment ceased for 
him. Pally proceeded to compute 
four measures for the groups: time 
required to solve the first Cr by the 
short method (if one was solved), 
the mean number of Cr’s solved, the 
number of Ss having to solve the Ex, 
and the mean time required for the 
Ex solution. It is obvious that these 
measures are not independent of one 
another. There is almost certain to 
be a marked relationship between 
the number of Cr’s attacked by the 
S and time involved in the first short 
solution. Similarly, the number of 
Cr's will be related to the number of 
Ss having to do the Ex, and so on. 
Hence the significance of at least the 
last three measures is likely to be un- 
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clear, especially if the results are con- 
flicting. 

Pally notes that he analyzed the 
time for the first Cr scores and the 
number of Cr’s by an analysis of 
variance. However, no F ratios are 
reported. A table of p values based 
on ¢ tests and chi squares is given, 
but the over-all analyses are absent. 
The various p values show that 
Groups A and B did not differ on any 
of the four measures. The same is 
true of Groups C and D. Group A 
differed from Groups C and D on 
three measures each. Group B dif- 
fered from C and D on two measures 
each. In each instance, the measures 
providing significant differences are 
not the same for C and D. Both 
Group A and B differed from Group 
C on mean time for solution of the 
Ex, but neither differed significantly 
from Group D. However, these find- 
ings are not directly comparable to 
those of Christie and Harris since 
only 30 of Pally’s 80 Ss reached the 
Ex. 

Cowen (13) divided his Ss 
three groups of 25 each. One group 
was subjected to ‘‘mild stress” prior 
to the WJT administration, a second 
group to while the 
third group was a control receiving 
He recorded the number 


into 


“Severe stress,” 


no stress. 


of long solutions, the time of response 
to all problems, and the time of re- 


sponse to an Ex. In each case, the 
means show a clear trend from fewest 
long solutions, and shortest response 
times for the control group, to most 
and longest for the severe stress 
group, with the mild stress group in 
between. The three F ratios are 
highly significant. In a corollary 
study, Cowen (12) contrasted a stress 
group and a “praise”’ group. Signifi- 
cant differences were obtained for 
number of long solutions and for time 
to solve an Ex. The difference in time 
of solution of all problems was not 


significant. Cowen concludes that 
“less rigid behaviors were noted in 
the ‘praise’ group, presumably as a 
function of the anxiety-reducing ef- 
fects of E’s praise and reassurance” 
(12, p. 427). 

This conclusion deserves some fur- 
ther consideration in light of the pre- 
vious Cowen study (13). In that 
study, a neutral control group had a 
mean of 1.20 long solutions. The 
praise group in the second work had 
a mean of 3.16. The stress group of 
the second study (apparently the 
same group listed under “severe 
in the earlier report) had a 
mean of 5.12. Evidently, Cowen’s 
conclusion is mot borne out by the 
data which show clearly that both 
stress and praise succeeded in in- 
creasing the proportion of long solu- 
tions. The same inference may be 
drawn from the data on time of re- 
sponse. The mean time for all prob- 
lems for the praise group is 33.60 sec- 
onds. For the neutral control, the 
mean is 21.28 seconds, and only 30.36 
seconds for the mild stress group! 
The mean time of solution for the Ex 
is 75.20 seconds for the praised Ss, 
24.72 seconds for the neutral control, 
and only 62.64 seconds for the mild 
stress group. An interpretation of 
this analysis is not immediately ap- 
parent. Perhaps praising an S for 
performance on a projective test and 
expressing interest in his further per- 
formance for correlational purposes 
(Cowen’'s “‘praise’’ technique) actu- 
ally places the S in a stressful situa- 
tion. 

The study of Sivers (51) furnishes 
an interesting comparison with the 
five experiments discussed thus far 
in this section. Sivers approached 
the question from another angle. He 
distinguished a rigid and a nonrigid 
group on the basis of WJT scores and 
then selected a subsample of 44 Ss 
from each group, matched on the 


” 
stress 
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basis of scores on Form A of the Ab- 
stract Reasoning Test. Half of the 
Ss in each group were subjected to 
stress, after which all Ss took Form 
B of the ART. An analysis of vari- 
ance of the difference scores between 
the two forms showed a highly sig- 
nificant variance due to stress, but 
an insignificant variance for rigidity, 
and no interaction. In other words, 
stress interfered with performance on 
Form B, but the effects were uncom- 
plicated by rigidity. The abstrac- 
tion ability of the rigid Ss was no 
more impaired by stress than that of 
the nonrigid Ss. The implications of 
the Sivers study relative to the find- 
ings of Christie, Harris, Pally, and 
Cowen, will be discussed in the next 
section. 

The investigations of Brown (8), 
Applezweig (4) and French (20), 
though primarily correlational stud- 
ies, also furnish comparisons of de- 
scriptive data under stress and non- 
stress conditions. Differences be- 
tween mean WJT scores under stress 
and nonstress are not significant in 
the Brown and French experiments. 
Appelzweig reports mean scores for 
three small groups of Ss to whom the 
WJT was administered at different 
times. One group took the test a day 
after experiencing stress, another a 
week after, and the third without 
stress at all. The week-after group 
had the smallest number of short 
solutions and the unstressed group 
had the largest. The critical ratio of 
this difference is given as 2.04 with a 
p beyond the .05 level. 

Applezweig also reports a signifi- 
cant variance ratio for the scores of 
the two groups. As in the case of 
Christie’s data, the ordinary table of 
probability cannot be used. The 
Cochran-Cox adjustment raises the 
CR required for the .05 level of sig- 
nificance to 2.06, so that Applezweig’s 
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CR actually falls short. Furthermore, 
as is often the case with WJT data, 
Applezweig’s distributions are mark- 
edly skewed (Cf. 3). If we apply a 
median test to the data of the un- 
stressed and week-after-stress groups, 
a chi square of only 0.84, p=.26, is 
obtained. 

If the results of the Brown, French, 
and Applezweig studies are averaged, 
the over-all mean score for the stress 
group is 2.75 short solutions, and 3.00 
short solutions for the nonstress 
group. This difference is not likely 
to be significant. Certainly it is far 
smaller than those to be found in the 
Cowen studies. We may reasonably 


conclude that the descriptive data 
of the correlational studies do not 
seem to support the conclusion that 
stress is accompanied by an increase 
in long WJT solutions. 


STRESS AND THE VALIDITY OF THE 
WJT As A MEASURE 
OF RIGIDITY 


Some of the experimenters who 
have used the WJT in stress studies 
carefully phrase their results in terms 
of ‘‘problem-solving rigidity’’ or some 
similar expression. The inference, 
whether expressed more or less overtly 
or allowed to remain implicit, is that 
rigidity is a function of the situation 
rather than of the personality. None- 
theless, in discussion sections subse- 
quent to experimental results, these 
same experimenters will make gen- 
eralizations from situational to per- 
sonality rigidity. Problem-solving 
rigidity is viewed as a ‘‘paradigm of 
maladaptive behavior,”’ (13, p. 518), 
or it is regarded as “‘the same as that 
observed clinically and reported in 
studies dealing with a variety of 
pathological states’ (46, p. 352). 
Because of such statements, the stu- 
dies of the WJT under stress were 
considered by the present writer to 
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be actual efforts to demonstrate the 
validity of the WJT as a rigidity 
measure. 

In summing up the various studies 
of performance on the WJT under 
stressful conditions (4, 8, 10, 12, 13, 
20, 24, 46, 51), we could hardly say 
more than that the over-all picture 
is beclouded. The evidence certainly 
will not support the unqualified con- 
clusion that there is a greater degree 
of WJT “rigidity”? manifested under 
stress than under nonstress circum- 
stances. But let us assume, for pur- 
poses of discussion, that this conclu- 
sion is warranted. How does it bear 
on the validity of the WJT as a meas- 
ure of rigidity? 

To begin with, the WJT is a learn- 
ing paradigm, similar in many re- 
spects to other tasks used in learning 
experiments. Its singular character- 
istic is that one of the two competing 
responses is made dominant, but the 
weaker response is the “correct” one. 
Studies of the effects of stress on this 
type of learning have been carried out 
by Castaneda and Palermo (9), Far- 
ber and Spence (18), and Montague 
(44) among others. The findings are 
summarized in the following quota- 
tion: 


If the habit strength of the correct response 
should be relatively weak, an increase in drive 
should further increase the strength of the in- 
correct tendencies relative to the correct 
tendency, resulting in impaired performance. 
Furthermore, the degree of impairment should 
be a positive function of the number and 
strength of the competing incorrect response 
tendencies (18, p. 120). 


A translation of the Hullian dialect 
into water-jar terms leads to this 
statement: ‘‘The administration of 
set problems prior to the Cr’s or Ex’s 
makes the long (incorrect) solution 
the dominant one. When the S is 
placed in stress, the frequency of 
dominant responses to the Cr's or 
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Ex's increases. Furthermore, it in- 
creases as a function of the number 
of set problems which were admin- 
istered (i.e., as a function of the 
strength of the incorrect tendency).” 

The hypothesis that stress is ac- 
companied by an increase in long 
solutions is thus one which comes 
out of learning theory, and has been 
demonstrated by learning experi- 
ments. It fits the data of tasks like 
learning paired associates, discrimi- 
nating colored lights, or pulling levers 
as well as it does the results with the 
WJT. The findings of Cowen, 
Christie, etc., thus have no particu- 
lar bearing on the validity of the 
WJT as a rigidity measure unless one 
is willing to accept any simple learning 
task of a certain type as a rigidity 
measure. This inclusion would surely 
be unacceptable to those who regard 
the WJT as a personality index. 

Sivers (51) provides admirable 
support for this stand. He showed 
that rigid and nonrigid Ss on the 
WJT manifest similar impairment in 
performance on another task when 
stress is introduced. If the WJT were 
measuring rigidity, we would expect 
that the ‘‘rigid’’ Ss would be more 
affected by stress. In other words, 
while the WJT functions adequately 
as a learning task, it fails as a diag- 
nostic instrument. 


Tue WJT As A PSYCHOMETRIC 
INSTRUMENT 


Despite the widespread use of the 
WJT by psychologists, especially in 
doctoral dissertations, there has been 
only casual concern with its defects 


as a psychometric tool. There ap- 
pear to be three such defects, any 
one of which would be likely to be 
regarded as serious by formal test 
constructors. Two of these—loss of 
subjects due to criteria for accepting 
an experimental protocol, and the 
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skewness of distributions of scores— 
were pointed out several years ago 
by Levitt and Zelen (31). The third, 
unamenability of the test to esti- 
mates of reliability, has been practi- 
cally ignored by experimenters. Each 
of these points warrants some ex- 
tended discussion. 

Reliability. Sivers (51) is one of 
the few experimenters who has been 
concerned with the reliability of the 
WJT. He sums up the matter con- 
cisely, thus: 

The reliability of the water jar test as a 
measuring instrument is difficult to establish 
directly. Most of the commonly used tech- 
niques do not suffice, for in the course of taking 
the test problem series many subjects discover 
that they have not always availed themselves 
of the direct method. Once a subject is con- 
sciously aware of what he might have done on 
previous problems, and if he has used the 
indirect method when the direct method could 
have been employed, he comes alert to further 
possibilities of this kind. For this reason, a 
test-retest situation is inappropriate. A split- 
half technique is obviously not to be con- 
sidered inasmuch as test items cannot be 


equated (51, pp. 52-53). 


In short, performance on a subse- 
quent test will be likely to be af- 
fected by performance on the original 
test for many Ss. Test-retest and 
equivalent forms are thus out of the 
question. Bakan (5), whose four 
Ex’s each required a different kind 
of solution, actually attempted to 
construct equivalent forms.’ The 
correlation between forms was .42. 
However, Bakan’s test was an atypi- 
cal type, not used by any other ex- 
perimenter, and it is not certain that 
the reliability which she obtained ¢an 
be generalized to include other forms. 

The assumptions necessary for the 
computation of statistical estimates 
of reliability are obviously not satis- 
fied by a test in which the perform- 
ance on any one item is apt to be af- 


*The forms are not literally equivalent 
since the means differ significantly. 
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fected by performance on previous 
items. In fact, there does not seem 
to be any sound way in which the re- 
liability of the WJT could be esti- 
mated. The conscientious experi- 
menter who makes use of the WJT 
must find some method of rationaliz- 
ing away his inability to estimate its 
reliability. 

Loss of subjects. One of the unusual 
aspects of the WJT asa psychometric 
tool is that its use seems invariably to 
lead to a greater loss of Ss from the 
experimental sample than is custom- 
ary in psychological research. The 
loss is a result of various standards 
of performance required by the ex- 
perimenter on the preliminary prob- 
lems in the test. The S who fails to 
perform in the requisite manner is 
eliminated from the final, crucial 
phase of the testing. Unfortunately, 
almost half of the studies do not re- 
port either the standards or the sub- 
ject loss, although it is probable that 
the standards were applied in most, 
or all, cases. A few studies note the 
standards, but not the loss. In sev- 
eral instances, multiple standards 
were used, and only a pooled loss is 
recorded. Of 34 studies of adult sam- 
ples, only 16 report both standards 
and loss, while 15 report neither. 

One criterion is a sime qua non (in 
its most literal sense) of any WJT 
investigation—the solution of a req- 
uisite number of set problems by the 
long method. It is an obvious and 
well-demoustrated fact that per- 
formance on subsequent problems 
will be a function of the number of 
set problems solved by the long 
method. Hence the experimenter 
must necessarily see to it that all Ss 
advancing to the crucial stage have 
solved the same, or approximately 
the same number of sets by the long 
method. Despite the evident im- 
portance of this standard, only 12 
studies report its application, and 
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two of these do not give the conse- 
quent subject loss. In two of the re- 
maining 10, a pooled loss due to mul- 
tiple criteria is given. 

In the remaining eight studies, 221 
of 885 Ss, a total of 24.97 per cent, 
have been lost due to this criterion. 
This amounts to 9.27 per cent of the 
total sample of 2,385 Ss in the studies 
reporting both standards and _ loss. 
These data, as well as the losses due 
to other criteria, are shown in Table 
| 


In the Rokeach and Cowen forms 
of the WJT, a short solution (or a 
long solution) of a control problem 


a standard. In the 
seven studies in which the loss due 
to this criterion can be assessed, 181 
of 886 Ss had to be discarded for fail- 
This 
7.59 


per cent of the Ss in all studies giving 


is also used as 


ure on the control problem. 
amounts to 20.43 per cent, or 


both criteria and loss. 

“arith- 
is applied. It is 
most often not clear just what the 
experimenter means by this expres- 


Occasionally a criterion of 


metic accuracy” 


the loss may 
be due to simple inability to add and 
subtract. 


sion; in some instances, 


In others, this standard is 
probably the same 
ment of long solution § of 
(naturally, 


is the require- 
the 
the long solution cannot 
be properly used the arith- 
metic computations are accurate). 
One-hundred and nineteen Ss_ or 
22.54 per cent of 528 Ss were lost in 
two studies due to this criterion. This 
is 4.99 per cent of the over-all sample. 

In four studies where the loss due 
to multiple criteria is given as one 
figure, or where there is an unex- 
plained loss, a total of 113 Ss of 455 

~24.84 per cent—were lost..° This 


sets 


unless 


10 An indeterminate number of these 113 Ss 
were lost due to absences from testing ses- 
sions, or failure to volunteer to continue with 
the experiment. ‘These losses, of course, can- 
not be attributed to the properties of the 


TEST AND RIGIDITY 365 
amounts to 4.74 per cent of the whole 
sample. 

Over all, 634 Ss, or 26.58 per cent 
of the 2,385 Ss who were originally 
tested were eliminated from the final 
phase of the experiments. The per- 
centage tends to be much _ higher 
when younger Ss are tested. One 
hundred and sixty-three of 286 child 
Ss in the studies of Rokeach (47) and 
Cowen and Thompson (15) had to 
be eliminated, a loss of 57 per cent of 
the sample. 

Nor can the loss be attributed to 
group administration of the test. In 
10 studies using such administration, 
25.39 per cent of the Ss were lost, 
while in four reports of individual 
test administration, 32.57 per cent 
were lost. The loss was 22.66 per 
cent in two studies which did not 
note the type of administration. 

The over-all loss of over 25 per cent 
in the adult studies is a sizable attri- 
tion, and might very well result in a 
sampling bias. And losing one out 
of everv four Ss halfway through an 
experiment is hardly economical of 
time and research populations. 

The distribution of WJT scores. A 
test constructor ordinarily strives to 
instrument which will 
provide a normal, or nearly normal 
distribution of scores. His aim may 
be linked to theoretical considera- 
tions, but more importantly, a nor- 
mal distribution enables the experi- 
menter to apply parametric statistics 
—the most powerful available—in 
analyzing results. Regardless of the 
test, the individual experimenter 
rarely obtains true normality since 
most samples are relatively small. 
However, if the curve of the distribu- 
tion is symmetrical about a maxi- 


develop. an 





WIT. In all probability, the number of Ss 
thus lost is quite small, and a correction for 
the loss would change the over-all loss only 
slightly. 
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mum ordinate at the mean, or even 
if it is asymmetrical but not mark- 
edly skewed, the experimenter will 
usually assume normality in the, par- 
ent population so that he can resort 
to a parametric analysis. (The exact 
nature of the population distribution 
is seldom known.) But when the ob- 
tained distribution is multimodal, 
J shaped, or plainly skewed in some 
fashion, the assumption of underly- 
ing normality becomes untenable, 








T 
Middle 


POSSIBLE WATER.JAR TEST 6 AE 

Fic. 1. DistRIBUTIONS OF WATER-JAR TEST 
Scores SHOWING PERCENTAGES OF Ss Ops- 
TAINING THE HIGHEST, LOWEsT, AND MIDDLE 
SCORE OF THE RANGE OF PossIBLE SCOREs. 
THE Cr-CrEx Curve Is BASED on 442 Ss 
IN Stx Stupies. THE Ex CurvE Is BASED ON 
166 Ss In Two SruplEs. 


and parametric statistics are inap- 
propriate. An instrument which reg- 
ularly leads to skewed distributions 
of scores in individual experiments is 
hence undesirable experimentally. 

In an analysis of the data of four 
studies, Levitt and Zelen (31) called 
attention to the fact that the WJT 
furnishes an inordinately large num- 
ber of zero scores. They suggested 
that distributions involved were prob- 
ably badly skewed. In the studies in- 
volved in the present review, six 
others (8, 16, 20, 21, 24, 30) speci- 
fically mention obtaining skewed dis- 
tributions. Brown (8) and French 
(20) note that variance data are not 
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presented because ofskewness. Maltz- 
man et al. (42) and Applezweig (4) 
appear to be cognizant of maldistri- 
bution in their data by the use of tau, 
a nonparametric correlation coeffi- 
cient. Harris (24) reported that he 
converted his time measures into 
logarithms since the distribution of 
raw data was skewed. 

Nine studies (2, 3, 10, 24, 25, 27, 
31, 32, 39) either present the distribu- 
tions of scores, or give sufficient in- 
formation from which the shape of 
the distribution may be inferred. Of 
these, the distributions of time meas- 
ures in the works of Harris (24) and 
Christie (10) are both clearly non- 
normal. The range of possible scores 
in the studies using Cr, CrEx or Ex 
measures varies from 0-4 to 0-7, so 
that direct comparability is not sim- 
ply accomplished. A reasonably clear 
picture of the nature of combined 
distributions may be obtained by 
plotting the percentages of Ss at- 
taining the highest possible score, the 
lowest possible, and the score at the 
midpoint of the range. Figure 1 
shows such a composite curve for the 
Cr and CrEx data from six researches 
(2, 3, 25, 27, 31, 39), and a similar 
curve of Ex scores from two studies 
(32, 39)."! 

In the former group, 31.8 per cent 
of 442 Ss received the lowest possible 
score, which is zero in all instances. 
An additional 20.6 per cent attained 
the highest possible score, while only 
7 per cent fall at the midpoints. More 
than 50 per cent of all the Ss mani- 
fest ‘‘all-or-nothing”’ rigidity. The 
Ex curve is similar with 31.3 per cent 
of 166 Ss having a zero score, 37.3 


11 In some instances, the experimenter re- 
ported the combined frequencies for the two 
extreme scores at either end of the range, but 
did not separate them. The present writer 
divided the frequency evenly between the two 
scores in those cases. In view of the over-all 
data, this procedure probably tends to mini- 
mize the frequency of Ss at the extremes. 
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per cent attaining the highest possi- 
ble score, and 13.9 per cent at the 
midpoints. Here, more than two- 
thirds of the Ss fall at the extremes. 
The difference between the percent- 
ages falling at the extremes in the 
Cr-CrEx studies and the Ex studies 
is probably a function of the more 
limited ranges of possible scores in 
the latter group. At best, we must 
conclude that fully half of all scores 
in WJT studies are likely to be found 
at the ends of the distribution of pos- 
sible scores. Distributions will thus 
tend to be markedly U shaped, defi- 
nitely nonnormal. 

To summarize this section, the 
available evidence indicates clearly 
that the WJT is deficient as a psy- 
chometric tool in three important re- 
spects: (a) no reliability coefficient 
can be estimated for it, (6) about 25 
per cent of Ss originally sampled 
must be discarded along the wav, and 
(c) it leads to skewed distributions of 
scores, precluding the use of para- 
metric statistical analyses. 


METHODOLOGICAL SHORTCOMINGS 
OF THE WJT Srupies 


The WIT seems to be so poor an 
instrument that an incautious ex- 
perimenter will be easily led to com- 
mit errors of design and analysis. 
Researches with the WJT are rife 
with such flaws, some of which have 
been mentioned in previous sections 
of this paper. The most common of 
these is the use of parametric analy- 
ses, which are usually inappropriate, 
as the discussion of the last section 
shows. Statistics like the ¢ test, 7, and 
biserial r were used in 23 studies.” 
Other errors include the use of Ss 
who failed to solve the specified num- 


12 Use of the F ratio with nonnormal distri- 
butions is not always inappropriate in view of 
evidence (33) that the distribution of F is in- 
sensitive to the shape of the parent distribu- 
tions of variates involved. 
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ber of set problems in the crucial 
phase, and failure to increase small 
N’s to compensate for losses when 
individual test administration 
used. 

There were a number of other 
shortcomings which are not directly 
attributable to the test itself. Among 
these were the use of chi square with 
nonindependent frequencies or with 
very small theoretical frequencies, 
failure to correct small chi-square 
frequencies for continuity, inappro- 
priate use of one-tailed tests of sig- 
nificance, failure to use the over-all 
F test when more than two groups 
were involved, incorrectly stated 
probability levels, failure to adjust 
the probability level when variances 
were heterogenous, and inadequate 


was 


explanations of the arrangement of 
data for statistical analysis. 


That 


the evidence reviewed here 


fails to demonstrate the validity of 
the WJT as a rigidity measure ap- 


pears to be an unchallengeable con- 
clusion. However, many of the studies 
are methodologically poor, so that it 
is possible to argue that the WJT has 
not yet been subjected to sound in- 
vestigation, and that any conclusion 
should hence be held in abeyance. 
The adoption or rejection of this 
stand is left to the reader’s discre- 
tion. 


SUMMARY AND CONCLUSIONS 


Thirty-one correlational — studies 
involving the water-jar Einstellung 
test and criterion measures were re- 
viewed. Although there are various 
forms of the test, and various meas- 
ures derived from it, no one was pre- 
dictively superior to the others. Stu- 
dies using the extinction problem as 
a measure of rigidity obtained no 
better results than those using the 
critical problem, or combinations of 
problems. Only five studies of the 
31 report positive results. About 75 
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per cent of over 200 individual cor- 
relations are rot. significant; the 
average of 18 correlations between 
the WJT and the California E and F 
scales is .07. Brown's hypothesis 
that the relationship between au- 
thoritarianism and rigidity is a func- 
tion of stress or ego-involving condi- 
tions is not borne out by the data. 
The average of seven correlations 
computed from data obtained under 
stress is only .05. 

Analysis of the relationships be- 
tween the WJT and the Rorschach, 
measures of emotional adjustment, 
concept formation, reasoning, and 
perceptual and motor tasks indicates 
that no individual index has a clear 
or consistent association with the 
WJT. An analysis of nine studies of 
intelligence and the WJT lead to the 
tentative conclusion that there is a 
consistent, low negative relationship 
between the WJT and intelligence. 
This conclusion is supported by two 
factor-analvsis studies, both of which 
place the WJT in factors heavily 
loaded with intelligence, though not 
in factors termed rigidity. 

Five noncorrelational studies of the 
WJT in experimental stress condi- 
tions are reviewed and criticized. The 
results of at least three must be re- 
garded as ambiguous, while those of 
Cowen (12, 13) indicate that both 
stress and praise may increase “ri- 
giditv’”’ on the WJT. The correla- 
tional studies of the WJT and stress 
in which comparative descriptive 
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data are presented, do not support 
the conclusion that WJT “rigidity” 
increases under stress. The study of 
Sivers (51) suggests that though test 
performance may be impaired by 
stress, the impairmeit is unrelated 
to scores on the WJT. A theoretical 
explanation of the effect of stress on 
WJT scores is offered. This explana- 
tion derives from learning experi- 
ments and learning theory, and at- 
tempts to show that increases in WJT 
“rigidity’’ under stress have no bear- 
ing on the validity of the WJT as a 
rigidity index. 

Three deficiencies of the WJT as a 
psychometric instrument are = dis- 
cussed. It is concluded that, (a) a 
reliability coefficient cannot be esti- 
mated for the test, (6) about one of 
every four Ss in an original sample 
will be eliminated from the crucial 
experimental phases due to various 
standards of performance required in 
preliminary stages, and (c) the WJT 
tends to produce nonnormal, usually 
U shaped distributions of scores. A 
number of methodological defects in 
the studies reviewed were also pointed 
out. 

The conclusions of this review can 
be summarized pithily in two state- 
ments: 

1. After eight vears of 
evidence for the validity of the water- 
jar test as a measure of validity is 
still lacking. 

2. The water-jar test is a poor psy- 
chological test qua test. 


research, 
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From the time of classical psycho- 
physics, the phi-gamma_ hypothesis 
has been widely accepted (7, 13, 15, 
18, 22, 39, 41). The hypothesis 
states that, in a psychophysical ex- 
periment, the relationship of the pro- 
portion of observer responses to stim- 
ulus values is accurately described by 
the integral of the normal probability 
curve (41). Recently, however, this 
hypothesis has been directly chal- 
lenged by the neural quantum? theory 
of sensory discrimination (37, 38, 45). 

Broadly stated, the theory of the 
neural quantum may be considered as 
an attempt to explain a paradox in 
modern sensory psychology. How 
can a continuous change in environ- 
mental energy give rise to a (seem- 
ingly) continuous change in sensory 
experience, when it is generally 
agreed that sensory mechanisms are 
composed of discrete neural elements 
which follow the all-or-none law of 
physiology? The paradox may be re- 


solved in one of two ways: experi- 


mental evidence must be obtained 
which demonstrates either that (a) 
sensory nerve action is continuous 

1 This work was supported by a research 
grant NSF-G1285 from the National Science 
Foundation. 

2 The term ‘quantum,’’ as used in the 
present paper, has a meaning entirely dif- 
ferent from Planck's (20) quantum in physical 
theory, Hecht, Shlaer, and Pirenne’s (23) 
quantum in visual theory, and Gabor's (19) 
quantum in auditory theory. In each of these 
instances, the quantum refers to a unit of 
physical energy; here it refers to a functionally 
distinct unit in the neural mechanisms which 
mediate sensory experience. Hence, ‘‘quan- 
tum” in the present sense implies a perceptual, 
rather than physical, unit. 


or that (6) the (apparent) contin- 
uum of sensory experience is discrete. 
The classical phi-gamma hypothesis 
assumes the first alternative, since 
psychometric functions are typically 
found to be smoothly sigmoidal in 
form. The more recent quantal hy- 
pothesis assumes the second alterna- 
tive, since some psychometric func- 
tions have been obtained which are 
linear in form. In general terms, the 
latter findings have been interpreted 
as an indication that the change in 
nervous activity which leads to a 
discriminatory response proceeds in 
a stepwise manner by definite incre- 
ments or quanta and that these 
quanta are directly reflected in the re- 
sponse itself. 

Since the question of the best 
mathematical formula for represent- 
ing a psychometric function, together 
with its underlying implications, is of 
central importance in the area of psv- 
chophysics, it would appear that 
some detailed attention should be 
directed toward the newer theoretical 
developments. The purpose of the 
present paper, therefore, is to present 
a complete account of the theory of 
the neural quantum of sensory dis- 
crimination and to re-examine the 
theory in the light of experimental ev- 
idence accumulated since its incep- 
tion. 

EARLY NOTIONS OF 
SENSORY QUANTA 

In 1919, Titchener (42), in discuss- 

ing the problems of measuring the 


stimulus and_ differential limens, 
stated that once the ‘‘nervous ma- 
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chine’ was started and adequate 
stimulation was continued, sensation 
would follow continuously the changes 
of stimulus. While not expressed in 
quantal terms, the basic notion un- 
derlying the discontinuities observed 
at the stimulus and differential limens 
was that every sense organ offered a 
certain amount of ‘‘frictional resist- 
ance’’ to the stimulus. Supposedly, 
this resistance had to be overcome be- 
fore a corresponding change in the 
sensation would result.? 

Boring (8) in 1926 attacked the 
problem of sensory experience more 
directly and pointed out that any 
theory based upon specific energies 
of nerves is a theory of sensory 
quanta. In hearing, according to this 
view, a new pitch is produced when 
the sound stimulus activates a dif- 
ferent neural element. Furthermore, 
the sensory continuum reduces to a 
finite number of small steps, or 
quanta, corresponding to the num- 


ber of discrete responsive elements 


in the given sense organ. This, es- 
sentially, was the’ problem Helmholtz 
(24) tried to solve years earlier when 
he compared the number of discrim- 
inable pitches with the number of 
rods in the organ of Corti. 

In the absence of experimental 
data demonstrating the existence of 
pitch quanta, Boring (8) rejected the 
resonance theory of pitch and sup- 
ported, instead, a frequency theory 
in which pitch depended upon the fre- 
quency of neural impulses and was, 


*A somewhat similar view has been ex- 
pressed more recently by Licklider (30, p. 
1001) who contends that ‘in the simplest 
conceptual neurology, the stimulus threshold 
owes its existence to the effect of a small 
barrier .. . between successive stages in the 
neural processes that underlie hearing. .. . If 
the DF (Difference Limen) is more than a 
statistical artifact, the neural mechanism 
must function in a stepwise or quantal 
manner. 
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therefore, nonquantal. Loudness, 
however, was related to the number 
of nerve fibers activated by the stim- 
ulus, thereby making it quantal. 

While not considering sensory 
quanta directly, Troland (43) pointed 
out the tendency toward ‘‘quantum 
theorizing’’ about the processes of the 
nervous system. It was suggested 
that ‘‘the all-or-none principle, as ap- 
plied to nerve activity, forces us to 
think of the latter in terms of fixed 
units of influence” (43, p. 37). 

In 1930, Békésy (1) presented the 
first experimental evidence which in- 
dicated that, with appropriate tech- 
niques, discrete sensory steps could 
be obtained, at least in the field of 
hearing. This was accomplished by 
presenting a standard tone of 0.3 sec. 
duration, followed immediately by a 
comparison tone of the same duration 
but of variable intensity. The ob- 
server reported whether or not he 
heard a difference between the two 
tones. The data of this study, plotted 
as percentage judged different against 
Al/I, vielded rectilinear functions 
which were interpreted as indications 
of the quantal nature of differential 
sensitivity to intensity. Apparently, 
Békésy (1) was able to minimize suf- 
ficiently the “extrinsic variability’ 
in the experimental situation such 
that the “true”? mechanism of sen- 
sory discrimination was finally re- 
vealed. 

In 1936, Békésy (2) obtained fur- 
ther evidence on the quantal nature 
of sensory functions. In this case, the 
minimum audible pressures for pure 
tones were determined from about 2 


’ 


“Extrinsic variability" refers to the vari- 
ability in factors outside the specific part of 
the sensory nervous system critically involved 
in making the required discriminations in a 
given experiment, e.g., changes in criteria of 
judgment, in attention, in motivation, etc. 
(35). 








NEURAL QUANTUM THEORY OF SENSORY DISCRIMINATION 


cycles per second (cps) to 50 eps by 
alternately increasing frequency and 
decreasing intensity. When the 
audibility curve was plotted, steplike 
discontinuities occurred at fairly reg- 
ular intervals between 4 cps and 50 
cps, with the most prominent step at 
18 cps. 


THe THEORY OF THE 
NEURAL QUANTUM 


The theory of the neural quantum 
in audition was made explicit by 
Stevens, Morgan, and Volkmann 
(38) and is derived from the assump- 
tion that the basic neural processes 
which mediate pitch and loudness dis- 
crimination operate on an all-or-none 
principle. | These processes are as- 
sumed to involve neural structures 
which are divided into functionally 
distinct units or quanta. ‘A further 
assumption of the theory states that 
a stimulus-increment will be discrim 


inated whenever it excites one quan- 


tum more than the number of quanta 
excited by the standard stimulus at a 
given moment.® 

On the basis of the assumption of 
the existence of neural quanta, sen- 
discrimination data would be 
expected to vield a rectilinear psy- 
chometric function. This would be 
accomplished theoretically in the fol- 
lowing manner. Suppose that a cer- 
tain stimulus excites completely a 
given number of quanta and that no 
stimulus energy, or residual, exists 
after this neural excitation has been 
accomplished. Then let stimulus- 
increments be added to the predeter- 


sory 


§ \When this assumption is met, the observer 
is said to have adopted a ‘‘one-quantum” cri- 
terion of discrimination. Usually, however, 
for reasons to be indicated in the development 
of the theory, such a criterion is difficult to 
establish and the observer will require that the 
stimulus-increment excite two additional 
quanta before a discrimination is reported. 
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mined stimulus under specified ex- 
perimental conditions of pitch or 
discrimination. If the 
neural units are stable and constant, 
the increments will not excite the ad- 
ditional quantum required for dis- 
crimination until their magnitude 
reaches a certain size. Thereafter, 
each time that the increment is added 
to the standard stimulus, a just no- 
ticeable difference (j.n.d.) in pitch or 
loudness should occur. If the data of 
this theoretical model were presented 
in the form of a psychometric func- 
tion, as shown in Fig. 1, with percent- 


loudness 


————— 


RESPONSE 
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NITUDE OP STIMULUS INCREMENT 


Fic. 1. FORM OF THE PsyYCHOMETRIC FuNC- 


TION ON THE SINGLE ASSUMPTION OF 
THE EXISTENCE OF NEURAL QUANTA 


age of response plotted against in- 
cremental magnitude, 0 per cent re- 
sponse would be obtained up to a 
certain point on the stimulus scale 
and 100 per cent response would be 
obtained for all increments above 
this point. These expectancies, how- 
ever, are not evidenced in auditory dis- 
crimination data and, consequently, 
an additional assumption on thresh- 
old variability must be introduced. 

This assumption holds that the 
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over-all sensitivity of the human or- 
angism does not remain at a constant 
level, but fluctuates momentarily and 
randomly® through magnitudes con- 
dierably larger than a single quan- 
tum. It follows, therefore, that the 
amount of stimulus energy required 
to activate a fixed number of neural 
units will vary with fluctuations in 
sensitivity. Conversely, a stimulus 
of given magnitude will excite a vary- 
ing number of neural units. Since the 
variation in the number of activated 
units is assumed to be quantal, or 


stepwise discontinuous, all of the_ 


available energy of a given stimulus 
will not necessarily be utilized in a 
given presentation. Thus, at a par- 
ticular moment, the given stimulus 
may excite completely a certain num- 
ber of quanta and leave a small 
amount of residual energy which 
“partially”? excites an additional 
quantum.’ This residual, while in- 


effective by itself to activate the 
next quantum, becomes available for 


summation with the energy provided 
in the succeeding stimulus-increment 
and may consequently produce a dis- 
criminatory response. 

The basic notions of the theory of 
the neural quantum which have been 
introduced up to this point are sche- 


6 This assumption appears in agreement 
with the available data of Montgomery (33) 
and Lifschitz (31) which indicate that the 
sensitivity of the ear approximates a normal 
distribution as it varies with time. These 
fluctuations are presumably due to extraneous 
factors, such as breathing movements, extra- 
loud heart beats, lapses of attention, shifts in 
motivation, etc. 

7 This notion of “partial” excitation follows 
the work of Stevens, Morgan, and Volkmann 
(38). While such a notion is inconsistent with 
a quantal function, it does aid in the concep- 
tualization of the theoretical model and, 
hence, will be retained in the present paper. 
A restatement of the concept in more precise 
stimulus terms would in no substantial way 
alter the theory. 
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matically represented in Fig. 2. Two 
continua are shown: (a) a stimulus 
continuum with an arbitrary scale, 
and (8) a sensory continuum with dis- 
crete neural units. On the stimulus 
continuum, S is the magnitude of the 
standard stimulus; ASg is the mag- 
nitude of the stimulus-increment 
which will always excite an additional 
quantum; and AS is the amount of 
energy (magnitude of the stimulus- 
increment) which is required to ac- 
tivate a ‘“‘partially’’ excited neural 
unit. On the sensory continuum, p 
is the amount of “‘partial’’ excitation 
resulting from the presentation of a 
given S. 

If Fig. 2 is taken to represent the 
condition of loudness discrimination, 
a stimulus magnitude of 17 energy 
units is considered sufficient to stim- 
ulate completely the neural elements: 
a, 6, and c. Neural element ‘‘d’’ is 
only ‘partially’? stimulated by the 
residual energy beyond 15 units. As- 
sume that such a situation would 
vield a given loudness. If the stimu- 
lus energy were reduced to 15 units, 
there would be no apparent change 
in loudness; but, if the energy were 
reduced to 14 units, neural element 
“c’’ would drop out and the loudness 
would diminish by one j.n.d. Like- 
wise, if the energy were increased to 
19 units by introducing a 2-unit stim- 
ulus-increment, no change in loud- 
ness would result. At 20 units, how- 
ever, the loudness would increase by 
one j.n.d. 

Two features of the diagram in Fig. 
2 should be noted specifically: (a) the 
size of the neural quantum is meas- 
ured in terms of ASq, and (0b) a “‘par- 
tially” excited unit can be stimulated 
by adding to S an increment (AS) 
smaller than the amount (4Sq) re- 
quired for stimulation when no “‘par- 
tial’’ excitation (p) exists. Fluctua- 
tions in sensitivity can, therefore, 
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Fic. 2. A SCHEMATIC REPRESENTATION OF THE Basic NoTIONS INVOLVED 
IN THE THEORY OF THE NEURAL QUANTUM 


bring about discriminatory responses 
to an increment smaller than one of 
neural unit size. This would account 
for the absence of the perpendicular 
psychometric function (see Fig. 1) 
predicted on the single assumption of 
the existence of neural units. 

As already indicated, a stimulus- 
increment (AS) smaller than one of 
neural unit size (ASqg) will excite an 
additional quantum only when the 
residual energy (p) is sufficiently aug- 
mented by the increment to provide 
a total supply of energy equal to or 
greater than that required for the ac- 


‘ 


tivation of a “complete’’ neural unit. 
Obviously, when the residual is large, 
the increment required is small; when 
the residual is small, the increment 
required is large. Thus, at any in- 
stant, the magnitude of the stimulus- 
increment necessary to add another 
quantum to the total number excited 
by the standard stimulus depends 
upon the amount of residual energy 
or “partial” excitation. Stated in 
mathematical form: 


aS=ASq—p, (1) 


where AS is the stimulus-increment 
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required to activate an additional 
quantum; ASq is the size of the incre- 
ment which will always excite one 
quantum; and p is the amount of 
“partial” excitation elicited by the 
surplus energy in the standard stimu- 
ulus (S). 

Equation 1 indicates that a given 
AS will completely stimulate the ad- 
ditional quantum needed for a dis- 
crimination whenever AS2ASq— p. 
As the size of the increment becomes 
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greater, an increase in the number of 
discriminations is to be expected. 
The precise manner in which the 
number of discriminations increases 
as a function of increment size de- 
pends upon the relative frequency 
with which the different surplus (re- 
sidual) values occur. 

The relative frequencies of occur- 
rence can be arrived at by the fol- 
lowing logical analysis. Assume, as 
before, that the over-all fluctuation 
in the sensitivity of the organism is 
large as compared to the size of a 
neural quantum. This fluctuation 
will produce a variation in the num- 
ber of neural units excited completely 
by the standard stimulus. Since the 
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amount of surplus or residual energy 
cannot exceed neural unit size, it 
must always stimulate “partially”’ 
the first unit beyond the last one 
stimulated by the standard stimulus. 
The surpluses will be spread out, 
therefore, over the same range of 
neural units stimulated during the 
course of fluctuations in sensitivity. 
The relative frequency with which 
the surpluses are distributed over this 
range will be dependent upon the 
time distribution of the organism's 
sensitivity. Assuming, as_ before, 
that the organism fluctuates in sensi- 
tivity because of a large number of 
unknown, independent factors, the 
distribution of surpluses over the 
range described will approximate a 
normal curve. 

This “chance” distribution of sur- 
pluses is shown in Fig. 3. The ab- 
has been arbitrarily divided 
into six equal neural units which rep- 
resent the range over which the or- 
ganism fluctuates in 
Each neural unit has 
divided, arbitrarily, into ten 
equal surplus values. Thus, a sur- 
plus of a given magnitude may be 
found to occur in each of the six 
neural units. The ordinate of the 
distribution represents the theoreti- 
cal relative frequency of occurrence 
of the surplus values. For example, 
let the surplus value equal 0.3 of a 
neural unit. The vertical lines drawn 
in the distribution will indicate the 
relative frequency of occurrence of 
this surplus value within each of the 
six neural units. Notice that this rel- 
ative frequency is not the same from 
unit to unit. 

The probability function of the 
surplus values can now be deter- 
mined. This is accomplished by sum- 
mating over the several neural units 
covered by the normal distribution of 
surpluses resulting from the organ- 


scissa 


sensitivity. 
been sub 


also 
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ism’s fluctuations in sensitivity, the 
relative frequencies of occurrence for 
each possible surplus value from zero 
to neural unit size. From the ob- 
tained distribution, the probability 
of a given surplus value may be de- 
termined. 

A graphical derivation of the prob- 
ability function of surplus values can 
be demonstrated by utilizing the 
representation in Fig. 3. Accordingly, 
Fig. 4 shows the function obtained by 
summating, over the six neural units, 
the relative frequencies of occurrence 
of the individual surplus values rang- 
ing from zero to one in 0.1 neural unit 
steps. For example, the segmented 
vertical line in the diagram of Fig. 4 
the summation for the 
surplus value equal to 0.3 of a neural 
unit. Each segment of the line, start- 
ing at the bottom, corresponds in 
length to the appropriate ordinate 
shown in Fig. 3. The same procedure 
has been followed to obtain the sum- 
mation value for each of the nine re- 
maining surpluses. Since the form of 
the obtained distribution is approxi- 
mately rectangular, and the neural 
unit is divided into ten equal parts, 
the probability for each surplus is 
the same. Thus, it may be said that, 
given a standard stimulus, any sur- 


represents 


plus value is as likely to occur as any 
other. 
arrived at mathematically by Bayes’ 


A similar conclusion may be 


(16) theorem, which states that the 
distribution of the probability inte- 
grals of any continuous curve is a 
rectangle with every probability be- 
tween zero and one equally likely. 

On the basis of the preceding an- 
alysis, the form of the psychometric 
function may be predicted. It has 
already been shown that the number 
of responses to an increment is a 
function of the size of the increment; 
the greater the increment, the greater 
will be the number of responses. The 
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rate at which the number of responses 
increases with the increase in the size 
of the increment can be determined 
from the frequency of occurrence of 
the surplus values. Since the proba- 
bility function is rectangular, one 
value of surplus occurs as frequently 
as any other. Therefore, for a given 
increase in the size of the increment, 
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the proportion of surpluses which can 
be augmented to neural unit size, or 
greater, is always the same. The 
rate of increase in the number of re- 
sponses is, then, constant and the re- 
lationship between increment size 
and percentage of response is clearly 
linear. 

Such a psychometric function may 
be graphically represented by a 
straight line, i.e., the integral of the 
rectangular probability distribution 
of surplus values. The zero point 
for this function should correspond to 
the value of the standard stimulus, 
since any increment, no matter how 
small, will find a surplus which it can 





JOHN F. CORSO 


augment to neural unit size, pro- 
vided the increment is presented a 
sufficient number of times. The 100 
per cent point should correspond to 
the smallest increment which always 
succeeds in exciting an additional 
neural unit. This increment, which 
must be independent of the surplus 
since it always produces a discrim- 
inatory response, provides a measure 
(ASq) of the size of the neural quan- 
tum. 

The foregoing statements of the 
theory of the neural quantum can be 
summarized in the form of mathe- 
matical equations. Equation 1 has 
already been formulated (AS=ASq 
— p) and indicates that an additional 
quantum will be activated whenever 
the amount of energy in an increment 
is sufficient to augment the surplus 
energy toa neural unit amount. Since 
the surplus (p) fluctuates between 
0<p<ASq and any value of p is as 
likely to occur as any other, the pro- 
portion of times that an increment 
will activate an additional neural 
unit is given by: 

AS 


ola 
i ASq 


where f; is the relative frequency of 
the instants during which AS excites 
one additional quantum; AS and ASq 
have the same meanings as previously 
given, 

Two features of the relationship 
expressed in Equation 2 should be 


noted: (a) the proportion of re- 
sponses increases as a linear function 
of incremental size and (6) the value 
of f; may vary from zero to one. 
Equations 1 and 2, however, hold 
only for those discrimination situa- 
tions in which the excitation of a 
single additional quantum is suffici- 
ent to produce a response; but, the 
evidence of Békésv (1), Miller and 
Garner (32), and Blackwell (4) shows 
that usually two additional units 


must be activated before a discrim- 
inatory response is reported. This is 
attributed to the fluctuations in sen- 
sitivity which occur during the pre- 
sentation of the standard stimulus. 
Since these fluctuations may produce 
surplus values of neural unit size, the 
subject finds it difficult to distinguish 
this excitation from that resulting 
from the presentation of an adequate 
increment. If, as indicative of an 
increment, the subject adopts a cer- 
tainty criterion which can be met 
only when two additional quanta are 
excited, he is then able to distinguish 
the effect of the surplus alone from 
the combined effect of increment and 
surplus. In this case, the proportion 
of times that a given increment will 
produce a discriminatory response 
may be expressed as follows: 
AS 
fet, [3] 
ASg 
where fy. is the relative frequency of 
occurrence of the instants during 
which AS excites two additional 
quanta. Observe that f,2 may also 
vary between zero and one. 
Equation 3 may be rewritten in 
terms of the percentage (P) of the in- 
crements to which an observer should 
be able to make a discriminatory re- 
sponse. In this form, 


AS 
p=(- 1) 100, (4) 
ASq 


and P may vary between 0 per cent 
and 100 per cent. 

Referring to Equation 4, incre- 
ments less than quantal in size will 
never stimulate two additional quanta 
since the surplus cannot exceed one 
unit; hence, the combined energy of 
increment and surplus will be less 
than that required for two units and 
no discriminatory response will occur. 
With increments of quantal size or 
greater, discriminatory responses will 
occur and will increase in the same 
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manner as described in the case of 
the “‘one-quantum”’ criterion. One 
hundred per cent response will occur 
at the smallest increment value which 
can always excite two additional 
quantal units. This value, which is 
independent of surplus, will be twice 
the size of the largest increment to 
which a response never occurs. This 
prediction follows from the assump- 
tion previously made that neural 
units are equal. The largest incre- 
ment to which a response “‘just 
never’ occurs is taken as a measure 
of the size of the first quantal unit; 
the smallest increment to which a re- 
sponse always occurs is taken as the 
size of two quantal units. 
quently, on the assumption of equal 
units, a two-to-one ratio obtains be- 
tween the value at which the psy- 
chometric function 100 per 
cent and the value at which it first 
departs from 0 per cent. 

If the experimental conditions and 
underlying assumptions of the quan- 
tum theory are satisfied, a typical 
psychometric function such as that 
for pitch or loudness discrimination 
should resemble the function pre- 
sented in Fig. 5. Two features of the 
function should be observed: (a) 
there is a linear relationship between 
the percentage of increments heard 
and the magnitude of stimulus-in- 
crements presented, and (6) there is a 
two-to-one ratio between the values 
of the function at the 100 per cent 
point (2 quanta) and the 0 per cent 
point (one quantum). These features 
of the psychometric function are the 
two specific deductions of the neural 
quantum theory of sensory discrimi- 
nation which can be subjected to ex- 
perimental verification. 


Conse- 


reac hes 


QUANTAL PREDICTIONS AND 

TECHNIQUES OF EVALUATION 
The first major prediction of the 
theory of the neural quantum is that 
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the percentages of stimulus incre- 
ments discriminated will be distrib- 
uted rectilinearly between 0 per cent 
and 100 per cent. Stated symboli- 
cally, P will be a linear function of 
AS. The classical hypothesis, as pre- 
viously indicated, would predict a 
sigmoidal probability function for the 


RESPONSE 








i 
1 2 3 


SIZE OF INCREMENT IN QUANTAL UNITS 
Fic. 5. FoRM OF THE PSYCHOMETRIC FuNc- 
TION PREDICTED BY THE THEORY OF THE 
NEURAL QUANTUM 


The question be- 
‘which of the two hy- 
potheses, sigmoidal or quantal, better 
fits these data points?” (35, p- 61). 
To answer this question, the best- 
fitting sigmoidal and rectilinear func- 
tions must be constructed for the 
given set of data.* While any one of 
several techniques may be employed, 
the curve-fitting process is most ade- 


same set of data. 


‘ 


comes, then, 


8 At least one investigator (11) has assumed 
that threshold data may be fitted by a log- 
Gaussian distribution, 1.e., an ogive expressed 
in terms of a logarithmic scale of stimulus 
magnitude. Since the normal ogive and log- 
Gaussian distribution are highly similar, no 
special case will be made in the present paper 
for this additional hypothesis. 
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quately accomplished by using either 
the method of least squares or the 
more recently developed technique 
of probit analysis (21). When the 
best-fitting sigmoidal and rectilinear 
functions have been obtained, it will 
usually be found that neither curve 
gives a perfect fit, i.e., passes through 
all the data points. Thus, an ap- 
propriate statistical test of goodness 
of fit, such as chi square (28), must 
be applied to determine the proba- 
bility that the experimental data 
could have been obtained by chance 
when the ‘true’ function was either 
rectilinear or sigmoidal in nature. 
The results of this analysis will indi- 
cate whether the specific theoretical 
hypothesis being tested should be re- 
jected or retained. 

The second major prediction of 
the quantum theory is that the small- 
est stimulus-increment at which 100 
per cent discrimination occurs will 
be twice as large as that at which 0 
per cent discrimination occurs. This 
holds only when an observer has 
adopted a “‘two-quanta”’ criterion of 
judgment.® However, regardless of 
the judgmental criterion adopted, 
the quantal index may be defined in 
the general case as follows: 

AS 


i =-——_—__—__,, 
¢ AS,;—ASo 


where Q/ is the quantal index, or pre- 
dicted ratio; AS, is the size of the 
smallest stimulus-increment at which 


* While this is the usual criterion adopted 
by an observer, it is possible for the observer 
to require that three additional quanta be 
excited by the stimulus-increment for dis- 
crimination to occur. This corresponds to a 
“three-quanta” criterion of judgment. No 
discriminations will occur until the increment 
is sufficient to excite two additional quanta; 
discriminations will occur 100 per cent of the 
time when the increment is sufficient to excite 
three additional quanta. Since this condition 
does not alter the basic formulation of the 
quantum theory, it will not be treated inde- 
pendently in the present paper. 
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100 per cent discrimination occurs; 
and AS» is the largest stimulus-incre- 
ment at which 0 per cent discrimina- 
tion occurs. Under a “two-quanta” 
criterion, AS; always excites two ad- 
ditional quanta, and AS, always ex- 
cites one additional quantum.!® Since 
the quantal units are assumed to be 
equal, QJ will equal two. For any 
quantal criterion adopted, Equation 
[5] will yield an integral value of QJ. 

In the computation of QJ, the val- 
ues of the stimulus-increments to be 
used in Equation [5] are -obtained 
by solving, algebraically or graphi- 
cally, for the 100 per cent and 0 per 
cent discrimination points in the lin- 
ear functions fitted to the experi- 
mental data. 


SOME REQUIREMENTS OF THE 
QUANTAL METHOD 


The demonstration of neural quanta 
apparently depends upon very rigor- 
ous experimental controls. If the rel- 
atively large, momentary fluctuations 
in over-all organismic sensitivity are 
not to the ‘“‘true’’ nature 
of the discriminatory process, cer- 
tain precautions must be taken. As 
stated by Stevens, Morgan, and 
Volkmann (38, p. 319), ‘‘we must 
add AJ instantaneously, and remove 
it before the organism is able to 
change in sensitivity by more than a 
negligible amount.’’ This require- 
ment dictates certain experimental 
procedures: (a) there must be no 
time interval between the presenta- 
tion of the standard stimulus and 
variable stimulus, and (6) the varia- 
ble stimulus must be of very short 
duration. If these conditions are not 
satisfied, the random fluctuations in 


obscure 


10 In this restricted case, ASy is the equiva- 
lent of ASq as previously defined, i.e., it de- 
notes the size of the neural quantum in terms 
of the stimulus increment which, under a ‘‘one- 
quantum” criterion, would yield 100 per cent 
discrimination. 
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over-all sensitivity may be expected 
to result in nonrectilinear psycho- 
metric functions (38). 

Other requirements have also been 
specified by Stevens, Morgan, and 
Volkmann (38). (a) The observer 
must experience little difficulty in 
making discriminatory judgments. 
This presupposes that the observer is 
well trained and that the experi- 
mental situation maximally aids the 
focusing of attention and the stabil- 
ization of judgment criteria. (b) The 
judgments must be made rapidly 
enough to eliminate the need for 
averaging results from different ex- 
perimental sessions. If possible, all 
judgments should be made in a single 
session, thus minimizing the effects 
of temporal variations. (c) Some ob- 
servers may be aided in directing 
their attention by introducing: a 
“warning” signal, dim 
light,"! at the proper moments in the 
test trials. This technique enables 
the observer to adjust to the series o! 
stimulus presentations and serves to 
reduce the fatigue of sustained atten- 
tion. (d) No transient sounds must 
be introduced in the transitions be- 
tween the standard stimulus and the 
comparison stimulus. If such sounds 
are present, they may be used as ex- 
trarfeous cues and will tend to distort 
the resulting psychometric function. 


such as a 


EXPERIMENTAL TESTS AND SOME 
CRITICAL COMMENTS” 


Following the earlier work of 
Békésy (1, 2), Stevens and Volk- 
mann (37) tested the hypothesis that 
loudness discrimination data could 


11 Some observers, however, find the light 
distracting and prefer to make judgments 
without this auxiliary cue (32, 38). 

12 These comments are not to be construed 
as criticisms of individual authors or or jour- 
nals, but are intended to point out some 
aspects of the experimental findings or of data 
analysis which may aid in appraising the 
present status of the neural quantum theory. 
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be adequately represented by a linear 
function in accordance with the re- 
quirements of the quantum theory. 
The experimental techniques em- 
ployed the precautions already out- 
lined in the preceding section of this 
paper. A single trained observer was 
used at a frequency of 100 cps pre- 
sented at five (20, 30, 50, 60, and 80 
db) sensation levels (SL, db above 
threshold). At each SL, the observer 
listened to a continuous tone whose 
intensity was increased for 0.15 sec. 
at 3 sec. intervals. The task of the 
observer was simply to press a kev 
whenever an increment was heard. 
Each increment was presented be- 
tween 50 and 100 times in random 
blocks of 25 presentations each. The 
obtained percentages of perceived 
judgments ranged from 0 per cent 
to 100 per cent. 

Since the obtained psychometric 
functions showed the predicted recti- 
linearity and the two-to-one integral 
relation, it was concluded that the 
data supported the quantum theory 
of discrimination. However, two fea- 
tures of the data analysis should be 
considered: (a) the two-to-one inte- 


gral relation was obtained on the 


basis of visually fitted psychometric 
functions, and (8) no tests of goodness 


of fit of these functions 
ported. 

Stevens, Morgan, and Volkmann 
(38) later extended the preceding 
study, using six trained observers in 
pitch discrimination. The procedure 
employed was essentially the same as 
in the case of loudness discrimination. 
A total of 100 judgments was made 
by each subject at each of several 
(eight to ten) frequency increments, 
all of which were less than 10 cps. 
The standard stimulus was a 1,000 
cps tone at 54 db SL presented in 
random blocks of 25 trials each. 
Functions were also obtained for a 
single observer at five-—16, 25, 46, 64, 


were re- 





382 


and 90 db—sensation levels, and for 
another observer at four—25, 30, 54, 
and 80 db—sensation levels. 

In the treatment of data, linear 
functions were fitted to the experi- 
mental values for pitch discrimina- 
tion by the method of least squares 
and phi-gamma functions were fitted 
to the same values by Boring’s (6) 
method.'* For purposes of curve fit- 
ting, Af was taken as the independent 
variable and all points falling below 
3 per cent and above 97 per cent were 
omitted in computing the constants 
of the fitted functions. For each of 
the fitted functions, a chi-square test 
of goodness of fit was applied and the 
corresponding P value was deter- 
mined. For both types of functions, 
the number of degrees of freedom 
was taken to be two less than the 
number of points to be fitted. The 


results of this analysis showed that 
in 14 of the 15 sets of data, the P 
values were higher for the rectilinear 
functions than for the phi-functions 


of gamma. In general, the P values 
for the functions predicted by the 
quantum theory were above 0.5, 
whereas those for the ‘‘classic’’ theory 
were less than 0.5. Furthermore, the 
two-to-one integral relation was found 
to hold rather well in most of the 15 
sets of data, but the values ranged 
from 1.89 to 2.34. 

While these data obviously favor 


13 This method utilizes the weighting of 
observations according to Urban's tables (22, 
44) which contain the products of the Miiller 
weights and Urban weights as they are 
generally applied in the constant methods. 
The Miiller weights are intended to equalize 
the effect of the various proportions of judg- 
ments upon the determination of the various 
corresponding values of gamma in fitting ob- 
served data to the phi-gamma function; the 
Urban weights are intended to place greater 
emphasis on the more reliable observations in 
solving for the constants of the phi-gamma 
function by the method of least squares. 
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the quantum theory, a re-examina- 
tion of Table I in Stevens, Morgan, 
and Volkmann (38, p. 329) shows 
that the phi-function of gamma, de- 
spite yielding generally lower P 
values, is considered unacceptable in 
only one of the 15 fits according to 
Culler’s (12) interpretation. Nine 
of the phi-gamma functions have a fit 
described as ‘‘good” or better, com- 
pared to 14 rectilinear functions in 
the same classification. It would ap- 
pear, therefore, that on the basis of 
the individual chi-square values ob- 
tained in this study both hypotheses 
remain tenable. 

In an attempt to demonstrate a 
more decisive difference between the 
goodness of fit for the classical and 
quantal hypotheses, a composite of 
the individual P values was also 
computed by Stevens, Morgan, and 
Volkmann (38). Since the composite 
P value for all 15 sets of data taken 
together was 0.931 when the rectilin- 
ear functions were fitted and only 
0.008 when the phi-gamma functions 
were fitted, the quantal hypothesis 
was considered supported and the 
classical hypothesis was considered 
quite unacceptable. 

Flynn (17), however, has made 
three criticisms of the treatment of 
data in the Stevens, Morgan, and 
Volkmann (38) study: (a) disregard- 
ing those points below 3 per cent and 
97 per cent was considered unjustifi- 
able when the fitting was done to 
compare rectilinear and phi-gamma 
hypotheses since the critical aspects 
of this comparison involve these ex- 
treme values, (6) although the ob- 
servations were weighted for relia- 
bility by Urban’s' method in deter- 
mining the best-fitting normal ogives, 
no weighting was reported in fitting 
the rectilinear functions, and (c) n—3 


4 See footnote 13. 
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degrees of freedom (df) should have 
been used for the ogive fit. Flynn 
(17) concludes, nevertheless, that 
the general finding is probably cor- 
rect that 14 out of the 15 sets of data 
are fitted better by a rectilinear func- 
tion than by a phi-gamma function. 

Lewis and Burke (27) have also 
pointed out certain weaknesses in 
the application of the chi-square test 
to the Stevens, Morgan, and Volk- 
mann (38) data. (a) In comparing 
the goodness of fit of the two differ- 
ent functions, the same quantity was 
not minimized in the process of ob- 
taining the constants for the fitted 
functions. In fitting the linear func- 
tions, the sum of squared differences 
between observed and_ theoretical 


proportions was minimized; but, in 
fitting the phi-gamma functions, the 
sum of the squared differences be- 
tween observed and theoretical val- 
ues of gamma was minimized. This 
would tend to yield chi-square val- 


ues for the phi-gamma functions that 
were inexact and probably inflated 
by an unknown amount. (8) In the 
analysis of the pitch discrimination 
data for the six observers at 1,000 
cps, 54 db SL, four extreme empirical 
proportions were excluded in de- 
termining the constants of the fitted 
phi-gamma functions, but were in- 
cluded in the calculations of chi 
square for individual observers. This 
procedure would also tend to inflate 
the composite value of chi square. (c) 
There were seven theoretical propor- 
tions representing small theoretical 
frequencies than 10) which 
should preferably have been com- 
bined with adjacent proportions. 
When these factors were considered 
in the calculation of new values of 
chi square for the phi-gamma func- 
tions of the six observers, the com- 
posite chi square was 11.76, with 10 
df, as compared to the original value 


(less 
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of 33.80, with 21 df." Since the re- 
calculated value of chi square falls at 
about the 30 per cent level of confi- 
dence, the phi-gamma_ hypothesis 
cannot be rejected. 

In a study on pitch discrimination, 
Flynn (17) used three trained ob- 
servers to listen to a continuous 1,000 
cps tone at a 55 db SL. The tone, 
lasting 1.25 min. per block of 25 
trials, periodically changed in fre- 
quency for 0.30 sec. The task of the 
observer was to report each time 
this change in frequency was de- 
tected. The number of trials for each 
increment was usually 50 or 100, with 
extremes of 35 and 200. Thirty sets 
of data were obtained from the three 
observers. For each set of data, the 
best-fitting normal ogive and the 
best-fitting straight line were com- 
puted by the method of least squares. 
Precautions were taken to weight the 
empirical proportions in fitting both 
of these functions so that any ob- 
served differences in goodness of fit 
could not be attributed to differences 
in fitting techniques.’® 

On the basis of chi-square tests of 
goodness of fit as adapted by Thom- 
son (40),!7 16 sets of data were found 
to fit neither a straight line nor a 


146 The computations for the chi-square 
values of the linear functions had the same 
weaknesses as those outlined for the phi- 
gamma functions, except that none of the 
empirical proportions omitted during the 
curve-fitting process were later included in the 
tests of goodness of fit. Lewis and Burke (27) 
also mention the fact that the linear functions 
were not obtained from weighted proportions. 

16 For the normal ogive the Miiller-Urban 
weights, corrected for the number of obser- 
vations, were used; for the straight line, only 
the Urban weights were needed and applied. 

17 There were two exceptions to Thomson's 
(40) procedure: (a) no attempt was made to 
compare chi-square values by means of the 
standard error, and (b) the number of df for 
the ogive was the number of percentages 
minus 3; for the straight line, minus 2. 
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normal ogive. Four sets of data were 
found to fit a normal ogive better 
than a straight line, while the re- 
maining ten sets of data could be 
fitted better with a straight line, than 
with a normal ogive. However, some 
of the weaknesses in data analysis 
mentioned by Lewis and Burke (27) 
were also evident in this study, e.g., 
the 0 per cent and 100 per cent points 
were omitted in determining the best 
fitting curves but were reintroduced 
in testing for goodness of fit. In 
addition, the two-to-one criterion de- 
manded by the quantum theory did 
not hold in those cases in which linear 
functions were obtained. Thus, it 
would appear that the evidence of 
this study contrary to Flynn's (17) 
interpretation, may be considered as 
failing to support the neural quan- 
tum theory since both predictions 
were not met. 

Koester and Schoenfeld (26), while 
comparing the relative merits of 
quantal and nonquantal procedures 


in pitch discrimination, were also 
interested in duplicating the quantal 


findings. In the quantal procedure, 
two highly trained observers were 
presented with a moderately loud 
1,000 cps tone. The standard was 
given for 1.0 sec. followed without 
interruption by an increment of the 
same duration. Twenty standard- 
increment pairs and two pairs in 
which the standard continued for 2.0 
sec. comprised a series of trials. The 
same increment was used throughout 
a series. The task of the observer was 
to report each time an increment was 
heard. 

For each observer, a complete set 
of psychometric data was obtained 
by the quantal method on each of 
four days. It was concluded that 
none of the data exhibited either the 
rectilinearity or the integral relation 
predicted by the quantum theory. 
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However, no mention was made of 
fitted functions or tests of goodness 
of fit. Also, since each of the eight 
psychometric functions was based on 
either six or seven points with only 
twenty observations per point, the 
conclusions of this study should be 
accepted with caution. 

In a study designed to investigate 
some of the factors which might ob- 
scure the quantal nature of the dis- 
criminatory mechanism, Miller and 
Garner (32) obtained intensity dis- 
crimination data for a 1,000 cps tone 
at 40 db SL on two observers using 
both the standard quantal procedure 
of Stevens, Morgan, and Volkmann 
(38) and a modified quantal pro- 
cedure. In the modified procedure, 
the stimulus-increment was altered 
at random after each presentation 
and the observer was not permitted 
to stop after every 25 presentations. 
This modification was introduced to 
prevent the observer from establish- 
ing a fixed “two-quanta” 
of judgment. 

The results obtained by the stand- 
ard quantal method showed that the 
two psychometric functions could be 
adequately represented by a straight 
line fitted by the method of least 
squares and that the predicted inte- 
gral relation was closely approxi- 
mated. For the modified procedure, 
the phi-gamma functions were fitted 
by the technique proposed by Guil- 
ford (22); the quantal hypothesis 
was evaluated by fitting a series of 
three straight lines to the empirical 
values lying between the successive 
quantal points as determined by the 
previously-administered standard 
method. On the basis of chi-square 
tests of goodness of fit, it was con- 
cluded that the quantal hypothesis 
provided a better description of the 
data than did the phi-gamma _ hy- 
pothesis. 


criterion 
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While these findings are indeed 
significant, two aspects of data an- 
alysis should be pointed out: (a) the 
question arises as to whether data as- 
sumed to have been obtained under 
three different certainty criteria and 
fitted by three different linear func- 
tions can be validly “connected” to 
yield a single psychometric function; 
and (b) assuming that a single func- 
tion is valid, each of the straight lines 
at the two extremes of the function 
must be made to intersect an addi- 
tional horizontally-linear segment if 
the function is not to predict impossi- 
ble response values greater than 100 
per cent and less than 0 per cent. 

In further analyzing the data of 
Stevens and Volkmann (37) and of 
Klynn (17), Miller and Garner (32) 
proceeded to show that (a) the pro- 
posed technique of fitting three linear 
segments to psychometric functions 
holds in general tor those cases in- 
volving criterion shifts by the ob- 
server and is not limited to loudness 
discrimination or to the random 
method of stimulus presentation, 
and that (6) combining of data ob- 
tained either under different experi- 
mental conditions or from different 
tended to vield 
metric functions in accordance with 
the phi-gamma hypothesis. Thus, 
the work of Miller and Garner (32) 
tends to support the quantum theory 
and the 
nonquantal 


observers psycho- 


serves to isolate some of 


factors responsible for 
findings. 

In a fairly extensive study, Corso 
(10) recently attempted to test the 
hypothesis that the data obtained in 
the auditory discrimination of fre- 
quency and. intensity satisfied the 
conditions predicted by the theory 
of the neural quantum. For intensity 
discrimination, the general procedure 
followed that of Stevens and Volk- 
mann (37); for frequency discrimina- 
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tion, that of Stevens, Morgan, and 
Volkmann (38). In all, 20 subjects 
were used in the study, after having 
been screened from a larger group of 
45 by means of an audiometric test 
and the Seashore pitch and loudness 
tests. Each subject was given two 
separate practice hours (for a total 
of 425 to 1,225 judgments) under the 
specific conditions of the test trials. 
Five subjects were tested under each 
of the following conditions of fre- 
quency discrimination: (a) 1,000 cps 
at 20, 40, 60, and 80 db SL, and (6) 
300, 1,000, and 3,000 cps at 60 db 
SL. A similar pattern was followed 
for intensity discrimination. For 
each test condition, at least six stimu- 
lus increments were presented, with 
approximately 200 judgments being 
made at each increment-value. 

In the analysis of data, linear func- 
tions were fitted to the individual 
sets of frequency and intensity dis- 
crimination data by the method of 


least squares, with all empirical pro- 
portions greater than 0.97 or less 


than 0.03 omitted. The chi-square 
test was used to test the goodness of 
fit of each obtained function. In the 
calculation of chi-square values, all 
theoretical proportions greater than 
0.97 and less than 0.03 were appro- 
priately combined with adjacent pro- 
portions. This technique insured that 
in calculating the chi-square values 
(a) no proportions representing the- 
oretical frequencies of less than five 
were used, and (6) no empirical pro- 
portions were used which did not 
enter into the solutions of the param- 
eters of the linear functions. Of the 
70 chi-square values computed, only 
nine (seven in frequency discrimina- 
tion, two in intensity discrimination) 
had a P value equal to or greater 
than 0.05. Of the nine psychometric 
functions in which the hypothesis of 
linearity was retained, only one had 
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a ratio-value which approached the 
predicted two-to-one criterion. It 
was concluded, therefore, that the ex- 
perimental results in frequency and 
in intensity discrimination failed to 
satisfy the predictions of the theory 
of the neural quantum. 

Licklider (29), in reviewing Corso’s 
(10) study, pointed out that failing 
to obtain psychometric functions 
which conform to quantal predictions 
can only disprove the quantum 
theory if “all the non-physiological 
error variance’ has been eliminated 
from the experimental measurements 
and the observer is worked at his 
“physiological limit.” Obviously, 
the existence or nonexistence of these 
qualifying conditions can (perhaps) 
never be known, but only assumed 
from the obtained data. One would 
expect, however, that in two essen- 
tially identical experiments such a 
source of error would be roughly 
equivalent, unless some unusual (and 
perhaps drastic) precautions were 


‘ 


taken in one experiment and not the 
other. 
The theory of the neural quantum 


has been extended to include the 
problem of sensory discrimination in 
areas other than audition. Jerome 
(25) obtained olfactory psychometric 
functions using stimulus pressure, as 
measured by an Elsberg olfactometer, 
as the independent variable. In the 
discrimination tests, one trained and 
one untrained observer, after becom- 
ing acquainted with the odor of citral, 
were instructed to indicate by their 
responses whether or not the odor 
was present when the stimulus was 
delivered. The task was presented 
as one of distinguishing between the 
stimuli from a control bottle and 
those from a citral bottle. There were 
ten presentations of the stimulus 
from each bottle at each of several 
(seven to nine) pressure values. 
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Seven psychometric functions were 
obtained from the data of the two ob- 
servers in the discrimination experi- 
ment, and six psychometric functions 
were obtained on two observers in the 
preliminary test of instructions. Lin- 
sar functions were fitted to each of 
these 13 sets of data by the method 
of averages. The results of this an- 
alysis showed that the criterion of 
rectilinearity was satisfactorily met ;!8 
but, since the additional criterion of 
the integral relation was not met, it 
was concluded that the existence of 
a differential olfactory quantum was 
not demonstrated. 

There are three apparent weak- 
nesses in Jerome’s (25) study if the 
data are to be used to evaluate the 
quantum theory’ (a) a nonstandard 
quantal procedure was used inasmuch 
as an interval of 30 sec. was permitted 
to elapse between stimulus presenta- 
tions to avoid olfactory fatigue; (0) 
only ten observations were made at 
the critical values at the extremes of 
the psychometric functions,!® and 
(c) no tests of goodness of fit were 
reported, presumably due to the 
presence of small theoretical fre- 
quencies which precluded the use of 
chi square. Thus, it appears that 
the data of this study cannot form an 
adequate basis either for the accept- 
ance or for the rejection of the quan- 
tum theory. 

DeCillis (14) attempted to follow 
the quantal procedure in finding the 
relation between amplitude of stimu- 
lus movement over a cutaneous area 
and frequency of positive response. 


18 Since no tests of goodness of fit were re- 
ported, it is presumed that rectilinearity was 
determined by visual inspection of the fitted 
functions. 

19 For example, of the 13 functions obtained 
five had no observations at stimulus values 
yielding between 80 per cent and 100 per cent 
response; seven had no observations between 
0 per cent and 20 per cent response. 
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The procedure employed was to pre- 
sent a fine air column, at a pressure 
of 35 Ibs. per sq. in., at a point on 
the skin for 0.10 sec. The air column 
then traveled across the skin at a 
rate of 143 mm./sec. to another 
point where it was again stationary 
for at least 0.10 sec. After the air 
column was turned off, the needle 
controlling the stimulus returned to 
its starting position. This procedure 
was repeated 20 times in a series, 
with the same amplitude of move- 
ment presented on each trial. Three 
subjects were used and _ sensitivity 
was measured on the fingertip, arm, 
and leg. The task of the observer 
was to report ‘‘ves’’ whenever move- 
ment of the air column was _ per- 
ceived and “no’’ whenever it was 


not. 

The best-fitting straight lines were 
calculated by the method of least 
squares for 35 selected psychometric 
functions, with the 0 per cent and 100 


per cent paints omitted in the curve- 
fitting process. The chi-square test 
for goodness of fit was applied fol- 
lowing the method of Brown and 
Thompson (9). No attempt was 
made to fit 16 sets of nonhomogene- 
ous data. The results of this analv- 
sis yielded 20 chi-square values with 
probability values equal to or greater 
than 0.95, while 15 values were 
smaller than this. It was concluded, 
therefore, that it was not “unrea- 
sonable to maintain that the best- 
fitting psvchometric function is recti- 
linear’ (14, p. 47). However, in 
those cases where the hypothesis of 
linearity was retained, the criterion 
of the integral relation did not hold. 
DeCillis (14, p. 49) contends that 
“apparently the integral relation is 
not to be expected in studies of ab- 
solute sensitivity.” 

It is unfortunate that in this study 
(14) extensive data were not col- 
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lected at those points on the psychos 
metric functions where maximal dif- 
ferences between the phi-gamma and 
quantal hypotheses were to be ex- 
pected. In the 20 out of 35 fitted 
functions in which the linearity hy- 
pothesis was retained, 19 functions 
had no observations at stimulus val- 
ues yielding between 0 per cent and 
15 per cent (or more) responses; 10 
functions had no observations be- 
tween 85 per cent (or less) and 100 
per cent responses. Thus, for a given 
set of data, since the 0 per cent and 
100 per cent points were also omitted 
in the curve-fitting process, the re- 
maining empirical points would prob- 
ably not have deviated from a 
straight line, whether or not the 
‘true’ function were ogival or linear. 

In the most recent attempt to eval- 
uate the quantal hypothesis, Black- 
well (4) obtained visual discrimina- 
tion data for four observers using 
normal binocular viewing and natural 
pupils at a luminance of 4.71 foot- 
lamberts. The stimulus was a circu- 
lar luminance-increment, subtending 
18.5’ located 7° to the right of the 
fixation spot and was presented for 
a duration of 0.06 sec. once every 
12.25 sec. Each psychometric func- 
tion 14 to 18 incre- 
ment-values with 20 observations at 
each point. The increments were 
presented in random blocks of 20 
trials each and the same increment 
was used throughout all the trials of 
a block. The observer indicated dis- 
crimination by responding ‘“‘yes’’ or 
“no” to each presentation of the 
stimulus. 

In the data analysis, a linear func- 
tion was fitted to a selected set of 
data for each observer.” The selec- 
tion of a specific set of data from 


was based on 


20 The exact method of curve fitting is not 
specified in Blackwell (4), but the use of 
probit analysis is implied. 
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among the total number of sets avail- 
able (Observer 1: 4 sets; 2, 18 sets; 
3, 24 sets; and 4, 10 sets) was carried 
out in such a way as to obtain the 
function most adequately fitted by a 
quantal curve. The results of this 
analysis revealed that the data of 
two observers could not be used to 
evaluate the quantal hypothesis. 
For one of these subjects, the stimuli 
were spaced too widely over the criti- 
cal psychometric range; for the other, 
the data were extremely scattered. 
The data of the remaining two sub- 
jects were fitted adequately in most 
cases by either a “two-quanta” or a 
“three-quanta”’ curve. 

Blackwell (4), however, considers 
this apparent confirmation of the 
theory as spurious. This assertion 
is based upon the fact that, as the 
criterion of judgment increases from 
two quanta to three quanta, the 50 
per cent threshold decreases rather 
than increases as expected from an 
extension of the quantum theory. 
The hypothesis is advanced that 
“response channelization” (5) may 
actually be responsible for the fact 
that some experimental data appear 
to conform to quantal rectilinearity. 
It is concluded (4) that the visual 
threshold-data obtained by the stand- 
ard quantal procedure do not contirm 
the predictions of the neural quantum 
theory. 


DISCUSSION 


While the predictions of the theory 
of the neural quantum are specific: 
(a) rectilinearity of the psychometric 
function, and (6) an integral relation 
between the values of the stimulus- 
increments at the 100 per cent and 
0 per cent response points on the psy- 
chometric function, the experimental 
task to evaluate these predictions is 
extremely difficult. As Blackwell 
(4, p. 398) states, “‘Essentially, the 
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quantum theorists have so restricted 
the allowable conditions of measure- 
ment and the analysis of data that it 
is difficult to obtain an unambiguous 
evaluation of the theory.”’ Licklider 
(29, p. 99) has aptly summarized the 
difficulties to be encountered in at- 
tempting to obtain negative evidence 
on the theory of the neural quantum 
by saying that “it is a shame that 
the quantum theory has such strong 
built-in self-protection.” 

In the first place, the applicability 
of the theory is restricted to data col- 
lected under one psychophysical pro- 
cedure: the quantal method (21). 
This is unlike the usual approach to 
psychophysical research where one 
or more methods may be appropriate 
for the investigation of a given prob- 
lem. Within the quantal method, 
Blackwell (4) objects to the “phe- 
nomenal report’’ as the only “‘indi- 
cator-response”’ and cites data (3) to 
support his contention that the 
“forced-choice’’ technique*! is more 
adequate than the phenomenal report 
in’ threshold) measurements under 
routine conditions. It is maintained 
that the use of ‘forced-choice’ as 
the “indicator-response’’ tends to 
minimize session-to-session variabil- 
ity when practiced 
used. 

A second restriction placed on the 
data-collection process is that stimu- 
lus increments must be grouped into 
blocks of presentations of the same 
magnitude. Miller and Garner (32) 
have demonstrated clearly that the 
random ordering of stimulus magni- 
tudes prevents even the well-trained 


‘ 


observers are 


21 The ‘‘forced-choice”’ technique is defined 
by two conditions: (a) the observer ‘‘is re- 
quired to indicate discrimination by correctly 
identifving some verifiable attribute of the 
stimulus such as its spatial location or tem- 
poral interval; and (5) he is required to select 
an answer on each stimulus-presentation— 
even if he has to guess” (4, p. 398). 
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observer from adopting a stable cri- 
terion, but the interpretation of the 
single(?) function obtained is not too 
clear. Blackwell (4) argues that the 
nonrandom block presentations pro- 
vide the with the oppor- 
tunity to respond in an invalid man- 
ner. It is maintained that the pres- 
ence of “positive response channel- 
ization’’ and of 


observer 


“negative response 
channelization” will operate to dis- 
tort threshold data into a form re- 
sembling that required by the quan- 
tum theorv.”? Senders and Sowards 
(36), in a study in which judgments 
were made of the simultaneity of 
presentation of a light and a tone, 
also found that successive presenta- 
threshold 
tended to produce long series of iden- 


tions of the stimulus near 


tical responses. 

Koester and Schoenfeld 
have a view somewhat similar to 
Blackwell's (4), that in 


the course of prolonged practice the 


26, p. 11 


contending 


observers may “‘learn to adjust and 
cut-and-fit their certainty 


criteria to 
the several values of the comparison 
stimuli in such a wavy as to vield the 
the 
integral 


necessary rectilinearity in psy- 
chometric function and the 
relation.” 

Osgood (35, p. 64) raises the same 
question on methodology by asking: 
“if the subject knows that all incre- 
ments in a given series are going to 
be identical, 1s it not possible for him 
to set up a subjective standard on the 
basis of which he graduates the fre- 
quency of his responses?” 
dence on this point is reported in the 


Some evi- 


29 44 


Positive response channelization” is de 
fined as an increase in the number of per- 
ceived judgments toward the ends of the 
blocks of stimuli for which the predominant 
response was positive. “Negative response 
channelization” is defined as an increase in the 
number of nonperceived judgments toward 
the ends of the blocks of stimuli for which the 
predominant response was negative. 
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study of Senders and Sowards (36) in 
which it was found that the observers 
tended to adjust their proportion of 
responses in accordance with their 
expectations. 

A third restriction which must op- 
erate in quantal studies is that only 
data collected in a_ single experi- 
mental session may be used to test 
the quantal predictions. Miller and 
Garner (32) have demonstrated that 
two sets of data obtained from the 
same observer by the same procedure 
but at different will average 
into a typical sigmoidal distribution, 
even though the separate functions 
are rectilinear. Nevertheless, such a 
restriction makes it practically im- 
possible to establish the adequacy of 
the quantum theory with any high 
Through a 
special application of the chi-square 
test, Blackwell (4) has shown that no 
matter many sets of experi- 
mental data are available, at least 40 
presentations must be made in each 
experimental session at each of 10 
stimulus-values if the normal ogive 
is to fail to fit the data, even though 
the data may actually conform to the 
specific requirements of the quantum 
theory. In the studies reviewed in 
the paper, most functions 
were based on less than 10 stimulus- 
values each and many functions uti- 
lized 25 or less observations per ex- 
perimental point. Apparently, under 
these conditions, unequivocal results 
could not have been expected. An 
additional comment is made by Lewis 
and Burke (27) who indicate that, 
in data analysis, the elimination of 
extreme proportions makes it impos- 
sible to obtain a critical evaluation 
of the phi-gamma hypothesis through 
the use of the chi-square test of good- 
ness of fit. 

There is some evidence to show 
that, perhaps, the restriction on the 


times 


degree of confidence. 


how 


present 
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combining or averaging of experi- 
mental data is unjustified. Koester 
and Schoenfeld (26) have shown 
that threshold data remain fairly 
consistent from day to day; Corso 
(10) found no differences among data 
collected in two successive hours; 
Myers and Harris (34) have reported 
that the fluctuation of the auditory 
threshold is approximately less than 
1 db for relatively short periods of 
time. It should be recalled, however, 
that the assumption of momentary 
and random changes in the over-all 
sensitivity of the observer is funda- 
mental to the derivation of the quan- 
tum theory. Thus, any compromise 
on this restriction would undoubtedly 
necessitate a major revision of the 
theory. Nevertheless, since the in- 
tegral relation predicted on the as- 
sumption of relatively large fluctua- 
tions in sensitivity has seldom been 
obtained, it may be that in the final 
analysis this assumption may prove 
to be unwarranted. 


In addition to the methodological 
problems already mentioned, certain 
other issues have been raised. Osgood 
(35) is disturbed by the observation 
reported in Stevens, Morgan, and 
Volkmann (38, p. 334) that some of 


the  stimulus-increments 
ceived as 


are per- 
“larger and plainer than 
others. Increments heard 80% of 
the time tend to be subjectively 
larger than increments heard only 
20% of the time.”’ If, as developed 
in the quantum theory, “‘discrimina- 
tion depends upon the addition of 
another neural unit to those already 
in operation, how can the same addi- 
tional quantal unit seem smaller 
when it is added only 20% of the 
time as compared with its being 
added 80% of the time?” (35, p. 65). 
While this might well be a critical 
issue for the theory of the neural 
quantum, its resolution is not readily 
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apparent, despite Jerome’s (25) con- 
tention that evidence was obtained 
for the existence of a “magnitude 
quantum” of olfactory sensitivity. 

Wever (46) has taken the position 
that peripheral quanta have not, as 
yet, been demonstrated and are not 
to be expected on the basis of the vol- 
ley theory of hearing. According to 
the volley theory, pitch is considered 
to be a continuous function of volley 
frequency for low and intermediate 
tones, and of spatial pattern for high 
tones; loudness is considered to be a 
continuous function of the magnitude 
of auditory nerve discharge which de- 
pends upon the number of active 
fibers and the rates of fiber activity. 

It should be recalled, however, 
that Stevens, Morgan, and Volk- 
mann (38) hypothesize that (a) the 
neural quantum appears at a central 
not a peripheral locus, (6) it is func- 
tional rather than anatomical, and 
(c) it involves a number of fibers 
rather than single fiber. Three argu- 
ments are offered to support this 
view as opposed to Békésy's (1) con- 
tention that the quantal unit is the 
individual nerve fiber: (a) the quan- 
tum for the individual observer has 
no fixed magnitude, (0) for a given 
sensory attribute, the number of 
auditory nerve fibers is greater than 
the number of quanta, and (c) the 
binaural quantum is approximately 
two-thirds the size of the monaural 
quantum. It should also be realized 
that the neural units of Stevens, 
Morgan, and Volkmann (38), whether 
or not substantiated, are hypotheti- 
cal constructs and do not specify the 
neural correlates of sensory attri- 
butes (35). 

One final point remains to be con- 
sidered. The derivation of the theory 
of the neural quantum is based on 
the assumption of two quantitative 
variables: (a) a physical continuum 
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and (6) a psychological continuum.” 
The modern theory of psychophysics, 
however, assumes three parallel quan- 
titative variables: (a) a_ physical 
continuum, (8) a sensory response 
continuum and (c) a judgment con- 
tinuum (21). According to. this 
schema, the notions of neural quanta 
are most directly related to contin- 
uum (6). However, there is a consid- 
erable amount of experimental evi- 
dence which shows that the regres- 
sion relating the judgment continuum 
(c) to the sensory response continuum 
(b) is not always linear, and neither 
is the ‘correlation always perfect. 
Thus, in the psychometric function, 
which relates continuum (c) to con- 
tinuum (a), it would be possible to 
obtain a curve unlike that which re- 
lates continuum (8) to 
(a). In other words, quantal func- 
tioning at continuum (6) might 
be evidenced in a psychometric func- 
tion unless a perfect, linear relation- 
ship between (c) and (8) could be 
obtained under certain conditions of 
careful experimental controls, atten- 
tive attitude of the observer, stabil- 
ized learning, etc. On the other hand, 
it is conceivable that quantal func- 
might 
uum (c) independently of the quantal 


continuum 


not 


tioning characterize contin- 
or nonquantal character of contin- 
uum (6). Thus, 
of a rectilinear psychometric function 
were to be unequivocally established, 
the interpretation of the 
underlying the function weuld not 
be immediately apparent. 


even if the existence 


processes 


» 


SUMMARY AND CONCLUSIONS 


The present paper has attempted 
to fulfill two primary objectives: (a 
to present a complete and detailed 
23 The term “continuum” is considered in a 
broad sense and permits a quantum theory for 
either variable. 
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account of the theory of the neural 
quantum of sensory discrimination, 
and (6) to review the literature on the 
quantum theory in order to assess 
the current status of the theory in 
the light of the total experimental 
evidence now available. 

It has been shown that from the 
theory of the neural quantum two 
spec ific hy potheses about the psycho- 
metric function may be derived and 
tested: (a) that a linear relationship 
obtains between the size of stimulus- 
increments presented and the per- 
centage of responses observed, and 
(b) that an integral relation obtains 
between the stimulus-increment val- 
the function at the 100 per 
and 0 per points-of-re- 
sponse. Data in support of these hy- 
potheses would indicate that the 
fundamental involved in 
sensory discrimination are discrete or 
quantal in character. 

While the hypotheses derived from 
the quantum theory are experimen- 
tally verifiable, severe limitations in 
methodology and in statistical treat- 
ment of data make it extremely dif- 
ficult to evaluate the tenabilitv of 
the hypotheses as opposed to the 
alternate the phi-gamma 
function. However, despite these 
limitations, it may be concluded that 
in certain 
psvchometric 
obtained. 


ues of 


cent cent 


processes 


\ ie Ww Ss of 


investigations rectilinear 
functions have 
The existence of the inte- 
gral relation, contrariwise, has sel- 
dom been demonstrated. Thus, when 
both factors considered in the 
body of available evidence, it appears 
that unequivocal support of the 
neural quantum theory is, for the 
most part, lacking. In addition, the 
validity of judgments obtained under 
the experimental conditions of the 
quantal method has been seriously 
questioned. 

The present review of literature on 


been 


are 
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the quantum theory suggests the the specific conditions under which 
need for future research along two. rectilinear psychometric functions 
major lines: (@) the development of | may be obtained in order to establish 
a more satisfactory technique for the validity and universality of 
statistically testing the goodness of quantal notions. Until such research 
fit of the quantal and phi-gamma is carried out, the issue of the neural 
hypotheses to a set of experimental quantum theory of sensory discrimi- 
data, and (b) the determination of nation cannot be fully resolved. 
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TECHNIQUES FOR COMPUTING SHIFT IN A 
SCALE OF ABSOLUTE JUDGMENT! 
KURT SALZINGER 
Columbia University? 


The method of absolute judgment 
(single stimuli) has a long history. It 
started originally as a psychophysi- 
cal method, i.e., it was applied to 
physical stimuli. Since then, how- 
ever, it has come to be more widely 
applied, i.e., to stimuli which cannot 
easily be ordered on a physical con- 
tinuum. 

McGarvey describes the method 
of single stimuli in the following way: 
“The observer is simply presented 
with the members of a group of stim- 
uli one at a time and asked to render 
a judgment upon each by assigning 
it to one of a specified set of cate- 
gories”’ (7, p. 9). This assignment of 
stimuli to categories is sometimes 
referred to as a “naming response.”’ 

Investigators using the method of 
absolute judgment have referred to 
the observer's behavior as a forma- 
tion of a frame of reference (see Hel- 
son, 4, for example) in accordance 
with which he responds to each of the 
stimuli which he must judge. Experi- 
menters were also interested in dis- 

1 This article is based on a dissertation sub- 
mitted in partial fulfillment of the require- 
ments for the Ph.D. degree in the Department 
of Psychology in the Faculty of Pure Science, 
Columbia University. The author is indebted 
to Professors J. Zubin, H. Garrett, and C. G. 
Mueller, Dr. S. Kugelmass, and Mrs. S. 
Salzinger for their invaluable aid during the 
various phases of the dissertation on which 
this paper is based. He also wishes to acknowl- 
edge his gratitude to Professor J. Zubin, Dr. 
E. Burdock, Dr. S. Sutton and Mrs. S. Sal- 
zinger for their comments on this paper. The 
preparation of this article was facilitated in 
part by grant M586 from NIMH. 

2 Now at Biometrics Research, N. Y. State 
Dept. of Mental Hygiene, Psychiatric Insti- 
tute, 722 West 168 St., New York 32, N. Y. 
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covering how to modify the frame of 
reference of an observer. This modifi- 
cation of the frame of reference is 
known as shift and has been brought 
about by, among others, such varia- 
bles as anchor stimuli (stimuli not 
originally included among the stimuli 
presented to S) and social stimuli (the 
judgments rendered by another 5S). 

It may be defined as a systematic 
change in the categories to which the 
observer assigns the members of a 
group of stimuli or as a systematic 
change in the stimulus values to 
which he gives particular naming re- 
sponses. 

Along with the study of the pa- 
rameters affecting absolute judgment 
have come a number of different 
techniques for the statistical treat- 
ment of the data to arrive at a meas- 
ure of shift. In this paper, an at- 
tempt will be made to review the 
techniques used up to the present 
time for computing the amount of 
shift as well as to present two new 
techniques. 

Perhaps the most commonly used 
technique for measuring shift is based 
upon the ratings. In this 
method numbers are applied to the 
categories either during the experi- 
mental situation or afterwards and 
numbers are then treated as 
an equal ratio scale. To evaluate 
whether a shift in judgment has taken 
place from one condition to another, 
means, differences 
variances, and critical ratios of these 
ratings are calculated. The technique 
of ratings has been used by, among 
others, Helson (4) for judgment of 


use of 


these 


between means, 
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weights as well as in the field of vi- 
sion. Both Heintz (3) and Brown (1) 
apparently felt the need for justify- 
ing the use of this method. The 
former did so by finding fault with 
other methods of data treatment (in 
which incidently he was justified), 
and the latter justified the procedure 
by appealing to the fact that Likert 
(5) found high positive correlations 
between scores based upon frequency 
counts of judgments and arbitrary 
ratings applied to the attitude items 
of a questionnaire. Brown ignores 
the fact that Likert used judgments 
of different stimuli, i.e., Brown used 
weights while Likert used verbal 
stimuli. Furthermore, the fact re- 
mains clear that this method of treat- 
ing data’ consists of the application 
of numbers to events (judgments in 
this without specifying the 
operations in the experimental situa- 
tion that would be equivalent to the 
operations involved when the num- 


case ) 


bers are combined statistically. It 
is here suggested that this method 
cannot be used since the operations 
performed with the numbers cannot 
be performed with the events (judg- 


ments) being quantified. In other 
words, no evidence is available to 
show that a rating of ‘4’ is two times 
as great as one of “2,” etc. 

Likert (5) uses ratings indirectly. 
He bases the values he assigns to 
judgments on the following: he as- 
sumes that the judgments are nor- 
mally distributed; then he deter- 
mines the value of each category by 
converting the proportion of Ss giv- 
ing each judgment (or the proportion 
of responses given by one S) to a 
standard score, which in turn is 
based on the assumption that the use 
of standard deviations results in 
equal interval scales of the judgment 
continuum. This method is some- 
what long and necessitates a large 
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number of judgments or a large num- 
ber of judges to obtain reliable pro- 
portions. In cases where one postu- 
lates individual differences, it be- 
comes necessary to make separate 
calculations of the numerical values 
of each category of judgment for each 
S. Such a procedure makes the al- 
ready long procedure longer.  Fur- 
thermore, the assumption of nor- 
mality cannot always be justified or 
met. 

A third method which makes use 
of only the assumption of an ordinal 
scale was utilized by Mausner (6). 
He took median values of each S's 
judgments in two different situations. 
He then plotted Ss’ median judg- 
ments for groups of 20 trials, show- 
ing graphically what Ss shifted and 
under what conditions. While he did 
not apply any statistical tests to 
these scores (he did to different types 
of scores to be discussed below), this 
type of data lends itself to easy an- 
alysis by means of nonparametric 
tests like the median test, described 
in Mood (8). The point might be 
made here that the median is not a 
very discriminating measure espe- 
cially, for example, if only three judg- 
ment categories are used. It becomes 
more valuable with an increasingly 
greater number of categories. 

A fourth method of treating judg- 
ment data consists of computing the 
mean stimulus value to which a given 
judgment is applied under different 
conditions, e.g., under the usual con- 
ditions (unanchored) and under an- 
choring conditions. Shift can then be 
evaluated by computing the relevant 
statistics. For example, the unan- 
chored and anchored conditions may 
be compared by testing for the sig- 
nificance of difference between the 
means of the stimuli for the two con- 
ditions. As long as the stimuli being 
measured are physical in nature, con- 
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tinuous, and the application of num- 
bers to the events (stimulus values) 
can be specified in terms of physical 
operations equivalent to the statisti- 
cal manipulations, the problems of 
the first method mentioned can be 
successfully avoided. This method 
was employed by, among others, 
Tresselt and Becker (11) for the 
judgment of length of lines. The 
first disadvantage of this method ap- 
pears in the way these investigators 
utilized it. They compared the mean 
length of line characterized as me- 
dium under two different conditions, 
leaving out the same data for the 
lines characterized as long or short. 
This was done because the response 
medium was the most frequent one, 
with fewer long and short responses. 
In the usual absolute judgment situ- 
ation where quite often as many as 
nine (see Helson [4]) judgments are 
used by S at least some of the stimu- 
lus values, equivalent to a given re- 
sponse, are indeterminate because 
the response has not been employed 
by S. This situation becomes even 
more extreme in the “‘shifted”’ situa- 
tion where the effect is such as to 
cause elimination of some of the 
judgments used in the ‘“‘preshift”’ 
condition. It becomes obvious that 
this method cannot be applied to all 
judgments because all are not always 
used, and even when all are used, 
they do not occur with equal fre- 
quency, thus resulting in different de- 
grees of reliability for the estimates 
of different judgments. What Tres- 
selt and Becker (11) did to get around 
this problem was simply to use that 
one response which had the largest 
frequency of stimulus values to esti- 
mate the judgment value. While 
this is a solution of a kind, it suffers 
from the fact that only some of the 
data can be used; furthermore, while 
the amount of data discarded for a 
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three-point situation is not very 
large it increases as the number of 
judgment categories increases. 

A fifth method applied to judg- 
ment data is that of graphing the re- 
sults in terms of percentages. Wever 
and Zener (12) plotted the percentage 
of times different categories of judg- 
ment were used by an S against the 
stimulus values to show that the 
method of absolute judgment vields 
the same kind of the 
method of comparative judgment. 
Postman and Miller (9) presented 
their data in a similar and perhaps 
more revealing fashion. Placing the 
judgment categories on the abscissa, 
they plotted the cumulative percent- 
age of the occurrence of each cate- 
gory separately for each stimulus 
value; thus they arrived at as many 
curves for each condition as there 
were stimuli presented. This pro- 
cedure was followed for the shift and 
preshift that these 
could be compared to determine the 
degree of shift. Presentation of the 
distribution of judgments under dif- 
ferent conditions (e.g., unanchored 
and anchored conditions) yields an 
excellent view of the phenomenon of 
shift. When graphing must be done 
for many Ss it becomes unwieldy; if 
groups are to be compared some index 
of relationship between the curves be- 
comes necessary, and finally since 
these curves are plotted in terms of 
percentage a great number of presen- 
tations becomes necessary for relia- 
ble curves. 

A sixth technique of treating the 
method of single stimuli was origi- 
nated by Mausner (6). He made a 
frequency distribution of the judg- 
ments of each S under two different 
conditions of judgment (‘‘A’’—S 
judging alone; ‘“T’’—two Ss judging 
in the presence of each other). He 
then took the difference in frequency 


results as 


conditions so 
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of occurrence between the two situa- 
tions (“T'’—“‘A”) for each category 
of judgment, e.g., if the judgment 
category “4” 
ation “I” 


ae 


occurred 0 times in situ- 
and 18 times in situation 
A” the difference for that category 
was —18 (““T-—A"” Then he 
found the midpoint of the judgment 


score ). 


categories, e.g., for an S who uses 
judgment categories 3, 4, 5, 6, 7, 8, 
9, 10, 11, 12, 13, 14, the midpoint 
would fall exactly between judgment 
categories 8 and 9. Algebraic sums 
of the T—A then ob- 
tained separately for all the judgment 


categor ies above Y(T—A 


scores were 
ieee: cae 
separately for all the judgment cate- 
2(T —A)tstce 
point. These two 
totaled without 
vield a shift score. 

This method is based upon the fol- 
The 


manifests itself in 


the mid- 
then 


sign to 


gid ries bel Ww 
sums were 


respect to 


lowing line of reasoning 
shift 


In frequency of use ot one 


phe- 
nomenon ol 
an increase 
half of 


responding decrease 


a scale of judgment and a cor- 
in irequency of 
use of the other half of the scale. The 
sum of the two subtotals,  Z(1 

A)a t+ S(T —A)nerw!, is then 
assumed to reflect the amount of shift. 

This method of data treatment as- 
sumes rank order of the judgment 
categories but in using frequencies is 
free from the objections raised against 
the rating method. 

Mausner (6) derived still another 
named the 
direction of shift. In this method, he 
the number of plus 


minus signs of the T 


score which he score ot 


counted and 
A scores, re- 
ferred to above, separately for the 
judgment categories above and _ be- 
low the midpoint. He took the dif- 
ference in frequency between the 
r—A 


separately above and below the mid- 


positive and negative scores 


point, e.g., if S uses 12 judgment 
categories (there are 6 above and 
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6 below the midpoint) and_ there 
are 5 negative T—A scores and 1 
positive T—A score below the mid- 
point, this will result in a difference 
score of 4 negative T—A scores; if 
there are 6 positive and no negative 
T—A_ scores above the midpoint, 
then we will obtain a difference score 
of 6 positive T—A scores. Remem- 
bering that the phenomenon of shift 
manifests itself in an increase in fre- 
quency of use of one half of a scale of 
judgment and a corresponding de- 
crease in frequency of use of the other 
half of the scale, we must add _ posi- 
tive T—A differences above the mid- 
point T—A differences 
below (negative ones above to posi- 
tive ones below the midpoint). If 
there is a preponderance of positive 


to negative 


l— A differences above the midpoint 
and/or a preponderance of negative 
I — A differences below the midpoint, 
then the direction of shift may be 
characterized as upward. This was 
true in the example given above since 
the 6 positive T—A differences above 
the midpoint must be added to the 4 
negative T—A differences below the 
midpoint to result in a direction of 
shift score of +10 (where +indicates 
an upward shift 
downward shift). 

It must be 
method 


and—indicates a 


that this 
Mausner 
because his degree-of-shift score does 
indicate the direction of shift. 


here 
was designed by 


noted 


not 


Usuallv such a score is not necessary. 


the number of 


categories is 


In addition, when 
judgment small the 
amount of discrimination possible 
between Ss is small. It must be noted 
that since both methods relv_ ulti- 
mately upon a counting procedure, 
thev have the advantage of not being 
open to attack from the point of view 
of the inequality of the distances be- 
tween the categories. 

The eighth and ninth techniques 
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of treating absolute judgment data 
were derived by the author for a 
weight-judgment technique applied 
to schizophrenics and normals (10). 
The first of these two techniques 
made use of the frequencies while the 
second made use of the physical scale 
of stimuli (weights in this case). 

The frequency method consisted, 
first of all, of tabulating the total 
number of judgments according to 
the categories: very heavy, heavy, 
medium, light, and very light for the 
unanchored and for the anchored 
conditions. If the anchor makes any 
difference in the judgments, the fre- 
quency of certain categories should 
decrease and that of others should in- 
crease. A heavy anchor would tend 
to decrease the frequencies of the 
heavier judgments and increase the 
frequencies of the lighter judgments, 
and vice versa for the light anchor. 
A comparison of the frequency dis- 
tributions by judgment categories 
was made between the unanchored 
and anchored conditions. This was 
carried out by using the Kolmo- 
gorow-Smirnov test (2). It involves 
a comparison of the cumulative fre- 
quencies of judgments under the two 
conditions. The maximum discrep- 
ancy found between the two cumula- 
tive frequency distributions for each 
S is a measure of amount of shift. 
This score will be designated as the 
category-shift score. 

An example of the manner of calcu- 
lation of the category-shift score is 
given below: 

a. Tabulation of the frequency of 
judgments in each condition (unan- 
chored, heavy anchor, light anchor) 
separately for each S, e.g., subject X, 


VL L M H_ VH 
NA 5 7 #4 6 3 
HA 8 100 |66 1 0 


where VL=very light, L=light, M 
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=medium, H=heavy, VH=very 
heavy; NA=unanchored condition, 
HA=heavy anchor condition. 

b. Cumulative frequency distribu- 
tion from VL to VH of the frequencies 
in each category for all conditions; 
following the above example the 
table under a would be transformed 
into the table below: 


VL L M VH 
N: > 12 16 2! 
HA 8 18 24 (2: 2 


where the entry in each cell now rep- 
resents the frequency of judgments of 
the judgment category to which the 
cell refers plus all the frequencies of 
all the judgments lighter than the one 
under consideration. 

c. Subtract ‘the cumulative fre- 
quencies of the appropriate NA from 
the HA conditions (the LA—light 
anchor condition—from the appro- 
priate NA conditions) and use the 
largest difference as the shift score. 
Following the above example: 

VL L M { VH 


HA 18 24 
NA ; 12 16 
D t 6 (8) 4 0 


where D =difference between the HA 
and NA conditions and the number 
in parentheses represents the maxi- 
mum difference between the two 
cumulative frequency distributions. 
This difference is the category-shift 
score. If an investigator so desires, 
he can evaluate the statistical signifi- 
cance of the shift separately for each 
S. Goodman (2) provides a table for 
this purpose. If interested in com- 
paring groups one can use the maxi- 
mum difference between cumulative 
distributions (the category-shift score) 
as a score for each S. These scores 
which are frequencies can then be 
manipulated statistically. This tech- 
nique like all methods making use 
of frequency can be manipulated sta- 
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tistically without fear of dealing with 
scales of unequal intervals as in the 
rating scale method. It has the ad- 
vantages of being simple in calcula- 
tion and of providing the investigator 
with an immediate estimate of sta- 
tistical significance. 

The ninth method of computing 
the shift score for each S used the dif- 
ference between the stimulus (weight) 
to which a particular judgment was 
assigned and the one to which it 
should have been assigned (i.e., the 
correct weight) according to prior 
verbal instructions given to S. These 
differences were then summed sepa- 
rately for judgments assigned to 
weights heavier and weights lighter 
than the one to which they should 
have been assigned. The difference 
between the two differences resulted 
in a separate score for the anchored 
and unanchored conditions; the dif- 
ference the anchored and 
unanchored condition scores in turn 
gave rise to the shift score which will 
be designated as the stimulus-shift 
score. 


between 


An example of the manner of calcu- 


lation of the stimulus-shift score is 
given below: 

a. The responses (judgments of) 
to each of the weights were tabulated 
as shown below, separately for each 
NA, HA, and LA e.g., 
subject me 


condition, 


NA 
M 
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b. Inspection of the tally tables 
for the unanchored condition (NA) 
and heavy anchor condition (HA) 
shows all the ‘correct’? judgments 
along the diagonal, that is, all! the re- 
sponses that have been attributed to 
the stimuli to which Ss were in- 
structed to attribute them. Above 
the diagonals are responses that have 
been attributed to stimuli /ighter than 
the ones to which Ss were previously 
verbally instructed to attribute them, 
while below the diagonal are the re- 
sponses that have been attributed to 
stimuli heavier than the ones to which 
Ss were previously instructed to at- 
tribute them. Thus it was possible 
to obtain two scores for each condi- 
tion, namely the number of responses 
attributed to stimuli heavier and the 
number attributed to stimuli lighter 
than the stimuli to which they should 
have been attributed according to 
previous instructions. To get a more 
exact measure of discrepancy be- 
tween judged and actual weight, the 
difference in grams between the ac- 
tual and the judged weight was ob- 
tained. 

For example, looking at the entry 
in the NA table above, the entry in 
Cell 200-L gave rise to a discrep- 
gms., since the 
judgment light was attributed to a 
weight 50 gms. lighter than the one 
to which it should have been at- 
tributed according to previous in- 
structions; the same holds for the 
entries in the Cells 250-M and 350- 
VH. By adding these three discrep- 
ancy scores, a total discrepancy score 
‘lighter’ of 150 gms. is obtained. 

Below the diagonal, discrepancy 
scores can be obtained in an analo- 
gous manner. Cell 250-VL gives rise 
to a discrepancy score of 50 gms. be- 
cause the response VL was attributed 
to a weight 50 gms. heavier than the 
one to which it should have been at- 


ancy score of 50 
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tributed according to previous in- 
structions. In Cell 300-L there are 
three responses that have been at- 
tributed to the same stimulus that is 
50 gms. heavier than the one to which 
they should have been attributed 
according to previous instructions, 
thus giving rise to a discrepancy score 
of 150 gms. The Cell 400-M shows a 
discrepancy of 100 gms. because the 
response M was attributed to a 
weight 100 gms. heavier than the 
stimulus to which it should have been 
attributed according to previous in- 
structions. Finally, Cell 400-H 
shows a discrepancy score of 100 gms. 
because there are two responses that 
have been attributed to the same 
stimulus that is 50 gms. heavier than 
the one to which they should have 
been attributed according to previ- 
ous instructions. By adding the 
above four discrepancy scores a total 
discrepancy score ‘heavier’ of 400 
gms. is obtained. 


By means of the same procedure 


two total discrepancy scores can be 
obtained from the HA table. The 
discrepancy “lighter” is 50 
gms. while the discrepancy 
“heavier” is 1300 gms. 

c. Subtract the total discrepancy 
score “‘lighter’’ from the total dis- 
crepancy “heavier” for each 
condition, thus obtaining an estimate 
of the bias or net tendency to make 
errors in the direction of attributing 
judgments to weights heavier or 
lighter than the ones to which they 
should be attributed according to 
instructions. In the example given, 
the net bias score for condition NA 
is 400 gms. —150 gms.=250 gms., 
while the net bias score for condition 
HA is 1300 gms.—50 gms. = 1250 
gms. 

d. Finally to obtain a shift score 
subtract the net score of NA from 
that of HA (of LA from NA). In 


this case the subtraction of 250 gms. 


score 


score 


score 
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from 1250 gms. yields a stimulus shift 
score of 1000 gms. for subject X. 
This last score made use of the 
physical scale underlying the judg- 
ment scale. It is based on obtaining 
the difference in the stimulus dimen- 
sion under consideration, that is, be- 
tween the stimulus being judged at 
any given time and the stimulus to 
which the judgment correctly be- 
longs. This shift score takes into ac- 
count not only the direction of shift 
and the frequency with which various 
categories are used but also considers 
the degree of shift, 1.e., how many 
gms. difference there is between the 
stimulus being judged and the stimu- 
lus which S thinks he is judging.’ 
Since both the category-shift score 
and the = stimulus-shift were 
computed on the same set of data, 
it was possible to compare them. 
Table 1 provides us with the rank- 
order correlations between the two 
types of shift scores, computed on 16 
normals and 16 schizophrenics during 
different experimental sessions and 
due to different anchors. Since the 
coefficients are high either score can 
be substituted for the other. Further 
inspection of the scores makes plain, 
however, that the stimulus-shift scores 
show greater discrimination than the 
category-shilt Amount of 
discrimination between Ss can be 
roughly measured in terms of the 
number of The 
total of such the 
weight-shift and 


score 


scores, 


tied scores for Ss. 
tied scores for 
score over all Ss 


3A measure of kinesthetic sense can be 
established by adding the total discrepancy 
scores “lighter” (above the diagonal) and 
“heavier” (below the diagonal) obtained in 
computing the shift measure; the smaller this 
total is, the better the kinesthetic sense. Thus, 
this score is really an error score. Using the 
example given for calculation of the shift 
measure, a kinesthetic score of 150 gms. +400 
gms. =550 gms. is obtained for the NA con- 
dition and a score of 50 gms. +1300 gms. 
= 1350 gms. is obtained for the HA condition. 
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TABLE 1 


RANK-ORDER CORRELATION COEFFICIENTS category-shift score was 107. 
BETWEEN Two METHODS OF MEASURING : 


4 : In conclusion, it can be stated here 

SHIFT FOR 16 EQUATED NORMALS AND : , et : 
Detianes 16 & Pomcwent an tee that while some of the criticisms (like 
AND LiGut ANCHOR CONDITIONS IN those made against the direct use of 
Two SUCCESSIVE SESSIONS 


conditions was 49 while that for the 


ratings in statistical manipulations 
based on assumptions of equal inter- 


Session — Condition — Normals Patients vals) would advise against any use 


Pen re ose Of: the method of computing a shift 
Light anchor .96** _O3** score, most of the criticisms are of 
the nature of specifying under what 
conditions a particular method might 
or might not show itself up to ad- 


Heavy anchor’ .92** 
Light anchor 95** 


cant beyond the .01 level vantage. 
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At our present stage of ignorance 
about how genes determine behavior, 
we might well concentrate on experi- 
mental studies of lower organisms. 
Their reactions may be thought of as 
the emergent behavior which has de- 
veloped through evolution into the 
complex behaviors of higher organ- 
isms. Knowledge gained from such 
studies may provide conceptual mod- 
els leading to an understanding of 
how hereditary and stimulus compo- 
nents interact in determining higher 
forms of behavior. 

For this purpose the use of lower 
organisms offers distinct advantages. 
There is a brief time span between 
generations, permitting & to perform 
in a short time period the various 
crossings essential to fundamental 
genetic studies. Each generation pro- 
duces abundant progeny, enabling 
E to recover the extreme behavior 
types required in selective breeding 
experiments. And further, the genet- 
ics of their morphology is better un- 
derstood than is that of higher forms. 
The fruit fly, Drosophila, has all of 
these advantages. 

First, however, reliable techniques 
for measuring individual differences 
(hereafter referred to as IDs) in be- 
havior must be developed. Reliabil- 
ity coefficients must be calculated, 
and they must be high. The problem 
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reduces to the question: How can we 
observe the behavior of large num- 
bers of very small Ss and at the same 
time reliably measure the perform- 
ance of each S? 

This paper presents a method 
which accomplishes both these ob- 
jectives. We call it the method of 
““mass screening with reliable indi- 
vidual measurement.”’ As an illustra- 
tion of the method, we will show that 
in the mass observation of a particu- 
lar behavior of Drosophila, reliability 
coefficients of about .9 can be se- 
cured in an experimental test period 
of four minutes. During this time 15 
sample observations of 15 sec. each 
were made. Each individual was ob- 
served as a member of a group of 
other flies. The method shows that 
Drosophila IDs can be measured as 
reliably as human JDs. Indeed, we 
know of no experiment on men cover- 
ing 15 brief observations that yields 
a reliability as high as .9. 

Genetics has up to the present con- 
cerned itself with physical character- 
istics rather than with behavior. The 
reliability of individual measurement 
is not so obviously important in the 
study of morphological characteris- 


tics; usually the characteristic is 


either present or absent, or present 
in only a small number of forms, and 
its presence or absence is immediately 


obvious, (e.g., eye color, notched 
wing, bar eyes, etc.). Individual dif- 
ferences in behavior, on the other 
hand, are not so easily recognized: 
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such recognition 
methods. 


requires special 

There are at least three reasons 
why we need reliable measurement of 
such JDs: 

1. Reliable phenotypic differentia- 
tion is needed for selective breeding 
for homozygous lines. Both the pur- 
ity of different strains and the rapid- 
ity of selection are limited by our ca- 
pacity to discriminate between indi- 
viduals, since, as the errors of meas- 
urement decrease, the probability 
increases that individuals with the 
same score will be genetically similar. 

2. The study of learning also re- 
quires reliable individual measure- 
ment because of the relation between 
the strength of the unconditioned re- 
sponse and conditioning? (Obvi- 
ously for those individuals in whom 
the unconditioned response has zero 
strength, conditioning is impossible.) 
We believe that the study of learning 
requires reliable knowledge of the 


distribution of 7Ds in the population 


being sampled. Much effort has 
been spent in demonstrating the in- 
fluence of environment on behavior. 
It is patent, however, that environ- 
mental influence must be an influence 
on something and therefore the laws 
of such influence must differ as the 
object influenced differs. 

3. Reliable individual measure- 
ment is essential for answering three 
questions about the generality of any 
behavior: (a) Temporal generality; 
how long does a given disposition to 
respond endure and to what extent 
does the rank ordering of individuals 
persist over this period? (6) Stimulus 
generalization; over what range of 
stimuli can the response be evoked 


27Use is made of conditioned response 


terminology for convenience of exposition. It 
is not intended to represent a_ theoretical 
statement about the nature of the learning 
process. 
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and how well is the rank ordering of 
individuals maintained over that 
range? (c) Behavior generality; to 
what extent do other behaviors pre- 
serve the rank ordering of individ- 
uals? 

Efficient. methods of observation 
are also a desideratum for studying 
small organisms. It is a theorem in 
sampling theory that the detection of 
extreme Cases, a necessity in genetic 
selection experiments, requires the 
observation of large numbers of Ss 
since the probability of finding these 
extreme cases is a direct function of 
the sample size. Rapid observation 
permits the examination of large 
numbers of Ss and thus increases the 
sampling stability essential to the 
generality of the findings. Further- 
more, replication of experiments can 
be undertaken without excessive 
labor. 

The next section of this paper pre- 
sents a method for reliably measuring 
IDs in behavior by means of mass 
screening, a procedure which achieves 
the objective of reliably classifying 
every individual's behavior without 
handling or observing each small organ- 
asm individually. The method is com- 
pletely general and easily applicable 
to the study of any behavior, both 
unconditioned and conditioned. 

This objective is illustrated by the 
results of an experiment that em- 
ployed the mass screening technique 
in the study of the geotropic reactions 
of Drosophila melanogaster. A series 
of 15 successive mass screenings, for 
example, produced 16 test tubes, each 
containing a different geotropic class 
of Drosophila. The flies in the tubes 
0 to 15 represent different degrees of 
the negative geotropism. That is, the 
flies are differentiated on this final 
composite 16-point scale based on 15 
prior mass screenings in which the 
individuals were not separately han- 
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dled. The reliability coefficient of 
this final scale score is determinable 
and in principle, it can be increased 
to any desired value by further mass 
screenings. 


EXPERIMENTAL DESIGN AND 
ANALYTIC PROCEDURES 
The method consists of cumulat- 
ing a total composite score X,;, for 
each organism in any behavior, X, 
where: 
X ~~ X, i Xo + NX3 -+- 


X,, X2,:-*-:, X, represent scores 
earned by it in » comparable sample 
mass screenings. 


+ Rin 


Setting up such a 
total score is the essence of psycho- 
logical test theory. Most of the for- 
mulae used in this paper are standard 
in psychological test theory. A sim- 
ple summary of them can be found in 
J. P. Guilford’s Psychometric Methods, 
Chaps 13, 14, 15 (1). Guilford’s ra- 
tionale of the formulae, however, is 
based on the factorial truth-error 
doctrine. In another paper one of the 
authors develops them with fewer 
assumptions (3). Our procedure 
adapts these principles to the prob- 
lem of calculating reliability coeftici- 
ents for the scores of individuals who 
are only observed as members of a 
large group. 

The main steps of the procedure 
are as follows: 

1. Conceptualize the behavior 
property, X, that is to be scaled, and 
operationally detine it with sufficient 
specification to indicate the general 
conditions under which it may be ob- 
served. 


2. Devise a standard test sample 
procedure for obtaining a unit meas- 
ure of JDs in X, one which has the 
advantage of permitting observation 
of a large group of Ss at one time 
while locating the total, V, of indi- 
viduals in subgroup classes scored 


0,1, 2, --+, kin magnitudes of X. 
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3. Take a randomly bred sample 
of the Ss and mass screen them 
through » replications of the stand- 
ard procedure. At the end of every 
replication, score each subgroup by 
its cumulative total score, X,;, then 
combine subgroups with the same 
X, score and proceed with the next 
replication. 

4. Calculate the reliability coetti- 
cient, rz, of each successive X, score, 
decide on the value of which will 
vield a reliability of sufficiently high 
magnitude, then examine the shape 
of the distribution of the X, 
the individuals. 

5. If the original method results in 
a low reliability or an excessively 
skewed distribution of final compos- 
alter the standard test, 
take a second random sample and re- 
peat the general procedure. Several 
such experiments may be required be- 
fore an adequate method of observa- 
tion is discovered. 


scores of 


ite scores, 


The details of the steps of this gen- 
eral procedure will be developed and 
illustrated by an experiment con- 
ducted by one of the authors on [Ds 
in the geotropic reaction of Droso- 


phila. 


1. Conceptualization 
of the Behavior 


and Definition 


The behavior chosen was the un- 
conditioned disposition to go in the 
direction opposite to gravity. This 
negative geotropism is operationally 
defined as an upward movement of the 
fly whenever it is placed in any situ- 
ation permitting travel upward, other 
external stimuli which might induce 
vertical movement being controlled. 
2. Standard Test Sample Procedure 

The test situation consists of two 
test tubes, a lower one standing up- 
right in a rack, the other inverted 


over the mouth of the lower one. 
Since the flies are also phototropic, 
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the light source was placed at right 
angles to the vertical. A group of 
flies are placed in the lower tube, 
shaken to the bottom, and then al- 
At the end of an 
arbitrary “‘cutting point’’ time of 15 
sec., a card is inserted between the 
lower and the upper tubes. The up- 
per tube is scored and labeled ‘'1,"’ 
and lower tube “0.” 


lowed to ascend. 


Thus the standard sample observa- 
tion in this case is like a dichotomous 
test item, the top tube scored “pass 
A cutoff 
point of 15 sec. was found empirically 


and the lower one “fail.” 


to divide the group of flies into two 
approximately fail 
subgroups, a avoids 


equal pass and 
which 
skewness in the distribution of final 


division 


composite X;, scores. 
It should be emphasized that di- 


chotomous scoring is mot a necessary 
The stand- 
ard procedure could have been de 

| 


restriction of the method. 


vised to provide more classes. The 
pass-fail break was chosen for experi- 
mental convenience. 

This 


though 


standard test procedure, 
the operational 


definition of geotropism, might not 


satisfving 
a systematic reaction 
to gravity. Since the test tube situa- 
tion permits only movement upward 
it may be that, if there is an activity 
differential among the Ss, the flies 
that are upwardly mobile may be 
very active flies. Only additional ex- 
periments which control activity can 
resolve the matter. 
term 


elicit uniquely 


Thus, we use the 
here only in an 
operational sense, recognizing that 
the JDs observed in this 
might later be shown to be signifi- 
cantly influenced by additional com- 
ponents. 


‘“gveotropism”’ 


situation 


3. Choice of an Unselected Sample 


Since the range and reliability of 
IDs is partly a function of the hetero- 
geneity of the Ss, a stock of unse- 
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lected Drosophila with a history of 
random mating was chosen. 


4. Mass Screening 


A random sample of 106 flies was 
screened and scored by the following 
procedure. 

a. First composite score, X.,=X,. 
The results of the first observation 
are shown in Fig. 1, which reproduces 
part of the score sheet actually used. 
Under X, and f; it can be seen that 
54 thies ascended to the upper tube, 
earned a “‘pass’’ and thus received a 
score of Y,=1. There are 52 flies that 
remained in the lower tube, earned a 
‘fail’ and received a score of X,=0. 
The scores, Xt» of this trial take the 
values of 1 and 0. 

b. Second composite score, X;, 
+X». The 54 flies with X,,=1 were 
put through the standard procedure 
The 46 flies 
that ascended earn a tube score, - 
=1, and a composite score X 1,=2; 
he 8 remaining down have X2=0 and 
X,,=1, as shown. In similar fashion 
the flies with X,,=0 divide into 22 
earning X,=1, X,,=1 and 30 earning 
X:=0, X,,=0. 

c. Third ife score, X= Xs, 
+-N.. The standard procedure is re- 
peated for each of the three X 
classes resulting from Trial 2. 

Note, even though there are four 
NX. tubes of flies at the end of Trial 2, 
there are only three X;, classes. The 
two subgroups with 8 and 22 flies 


a second time for Trial 2. 


1D 
Ce M1} 0 


t3 


have been combined in one tube be- 
cause both received the same score, 
X .,=1, i.e., the same composite score 
is the cumulative sum of all previous 
scores irrespective of the order in 
which the individual “passes” 
“fails’’ were obtained. 

d. Additional composite scores, X44, 
X ts. The procedure is contin- 
ued by taking further sample obser- 
vations; at the end of each one, sub- 
groups having the same X;, score are 


and 
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combined for the next observation. 
Figure 1 shows the results schemati- 
cally up through X;,,. 

The reason for the “experimental 
convenience” of dichotomous classes 
in the standard procedure should now 
be apparent; with more than two 
classes the number of subgroups be- 
comes unmanageable. 


5. Analysis 


a. The distribution of X, scores. 
One of the objectives of experimental 
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behavior genetics is reliable differ- 
entiation between individuals and 
subsequent genetic validation of dif- 
ferences by means of selective breed- 
ing. Since, for a given behavior, it is 
assumed that there is a range of abil- 
ity and that the Ss in a population 
are distributed over the range, it fol- 
lows that any methods which tend to 
pile up the final scores in a few ex- 
treme categories should be eschewed 
in favor of others which distribute the 
scores more widely. The individuals 
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whose behavior is under observation 
will be used for breeding, hence it is 
important to differentiate them 
clearly on the behavioral scale. Fail- 
ure to do this prevents the discovery 
of any genotypic differences that 
might exist. 

The £ can usually control the form 
of the distribution of total X, scores. 
In our illustrative experiment. this 
control through 
selection of the time interval in which 
the response can be performed, i.e., 
the proportions p, of) ‘‘passes’’ and 
g, of “fails” vary as a function of the 
amount of time allowed in the test 
tube. In examples from several ex- 


was accomplished 


periments it may be shown that when 
p>.5, the X, distribution is nega- 
tively skewed and whén p<.5, X; is 
hither type of 
skewness is undesirable because cases 
pile up in the extreme categories 
where, for the purposes of selective 


pr sit ively skewed. 


the differentiations 
are needed. 

This point is illustrated in Table 
1 where the frequency distribution 
of the composite score boom from Fig. 
1 is presented in the first row of en- 
tries. A 15-sec. cutoff was used for 
this sample. The mean proportion 
earning a score of X,-1 on the ten 
successive standard tests is J=.5. 
The distribution is seen to be platy- 
kurtic with no appreciable piling up 
of the cases in the extreme categories. 
This is the result of the approxi- 
mately 50-50 cut on each trial. 


breeding, finest 
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The effects of extreme cuts are 
shown in the other rows of Table 1. 
For the group with an 8-sec. cutoff 
in the standard test the proportion 
getting into the upper tube is =.16, 
with the result that the composite 
X,,, scores are very positively skewed 
with a pile up of flies in the 0 cate- 
gory. The opposite extreme cut of 27 
sec. gives a }=.66, with a pile up at 
the high X,, 

b. Reliability of X, scores. It is 
important that the composite X, 
score be reliable if E is to use the dif- 


, scores. 


ferentiations between individuals as 
the basis for further experimental 
work on selective breeding, condi- 
tioning, or the investigation of the 
generality of behavior X. The relia- 
bility coefficient, r,,, cannot be com- 
puted by the split-half method in the 
mass screening method because com- 
bining into a single group all Ss with 
the same composite X, score loses the 
specific sample score history of each 
individual. The coefficient can be 
estimated accurately, however, from 
the variances of the composite X, 
score atid of the individual test sam- 
ple scores, as follow Ss (3, Formula Rae 


n =V; 
— (1-= ). (1 
n—1 V, 


test 


where: 


m=number of standard 
samples or replications. 

7-=sum of the variances (o;") of 
the test samples. 


rABLE 1 


DISTRIBUTION OF INDIVIDUALS IN COMPOSITE X;, SCORE 


(Entries are frequencies) 


Cutoff 
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V,=variance of the final com- 
posite X, scores, 1.e., 0:7. 


When, as in the present case, the 
standard procedure gives a dichoto-: 
mous cut, the variance, V;, of any 
particular sample observation is: 


V >= Pq, 
where: 
p = proportion of individuals above 
the cut in all subgroups 
=mean score when, as in the ex- 
ample, those above the cut are 
scored 1, those below 0. 
qg=1-p. 


The values of the reliability coeffi- 
cients and of other constants for sev- 
eral Drosophila experiments are given 
in the third rows of Table 2. The first 
group is the one presented in Fig. 1, 
in which 15 sample observations were 
finally taken under conditions be- 
lieved to produce optimum differenti- 
ation between individuals. It will be 
noted that, beginning with the fourth 
column of entries, after the first few 
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“adjustment” trials the reliabilities 
progressively increased to .87 for the 
final composite score based on 15 
sample observations. 

The F naturally asks: are the suc- 
cessive sample observations strictly 
comparable measures of the property 
X, here the negative geotropic reac- 
tion? The additional constants of 
Table 2 give insight into this ques- 
tion. 

If the individuals systematically 
improve or deteriorate in perform- 
ance the mean score, p;, and the vari 
ance, V;=pq, of successive observa- 
tions will both change. In the first 
and second rows of Table 2 we see 
that in our example p; and therefore 
V; both remain relatively constant. 

If the individuals become either 
more reliably differentiated or less so 
as screening proceeds, then the reli- 
ability coefficient will not increase ac- 
cording to the “Spearman-Brown 
law” of increased reliability with the 
addition of comparable sample ob- 
servations. Evidence on this point 
can be secured in two ways. 


TABLE 2 


RELIABILITY COEFFICIENTS AND OTHER Cons1 


ANTS IN THE 


DROSOPHILA GEOTROPIC EXPERIMENTS 
SAMPLE OBSERVATION, X; 
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The first is to discover whether the 
mean correlation, #,;;, between sample 
observations entering ‘into the com- 
posite, X,, changes for successive X, 
scores. From the familiar Spearman- 
Brown approximation (3, Formula 
17), we note that the reliability coefh- 
cient, 71,¢,, for any composite 
based on ” samples is: 


whence, solving for 7,;: 


rt 


¢ 
nin 


i=— - [4] 
n—(n—I1)r;, 


i J 


fn 


The successive values of ri; are 
given in Table 2, fifth rows. We note 
that after the first few trials 7,; pla- 
teaus around .30, 

The other way is for|E to set a de- 
sired reliability for the final com- 
posite, and solve for the value of in 
Equation 3 that will achieve this de- 
sired re liability. Suppose E desires a 
reliability of .95. Callithis Ry. Set 
Ry, into Equation 3 and solve for n 


Ri(l1—4;; 
r(1—R\, 


The values of » for R,,=.95 are given 
in the fourth rows of Table 2. In gen- 
eral they remain around 45. trials. 
This finding has the practical value 
of informing E how many sample 
trials are necessary to a¢hieve the re- 
liability he desires. If m turns out to 
be too large, a design having more 
classes per trial might be considered 
as a means of reducing the number of 
trials required. 

When the individual! test sample 
the 
case when groups are screened on the 
multiple-unit discrimination maze 
(2), the reliabilit ycqgfhicient can be 
computed directly from the final dis- 


scores are not available, as is 
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tribution of X, scores by means of 
the Total Score formula (3, Formula 
a7)" 


n M,.—M?/n' 

fe=——t 1-——— i [6] 
n—1 Ve 

where J, equals the mean of the final 

composite X, scores. 

c. Domain validity coefficient of the 
composite score, X, The reliability 
coefficient, 7, though necessary in 
is not the 
best statement of the reliability of 
the composite X,. A more meaningful 
index is the correlation between the 
X, scores and that on an indefinitely 
large number of screenings, namely 
X,.. Though the “true score,’’ X,, is 
not available, the correlation ry, can 


the above formulations, 


nevertheless be estimated as follows 
3, Formula 2 


Pie . [7] 


Thus, in our case our X;, based on 
fifteen screenings would correlate re, 
=, .867=.93 with a perfectly relia- 
ble measure many such 
This coefficient also has 


based on 
screenings 
the following added meaning: If we 
had the true score of each fly based 
on many sets of 15 screenings, the 
ratio of the standard deviation of 
these true scores to that of the ob- 
served X, score would be .93. In 
short, the distribution of true scores 
would look much like that actually 
observed. 

d. Individual variance (‘errors of 
measurement’). In order to conduct 
experiments on selective breeding, 
conditioning, or generality it is neces- 
sary to get a practical estimate of 
the amount of difference in X, scores 
among individuals that is undeter- 
mined, i.e., not assignable to known 
sources of variation. This estimate is 
the individual variance, V, (3, For- 
mula 23a), where: 
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Vo= ViCl—riu). [8] 


In our example for X,,,, V.=4.40, 
hence the individual standard devia- 
tion is: 


oo=4.40\/1— .867=1.6. 


The necessity of a high reliability 
can be seen in the above formula: as 
the reliability approaches unity the 
amount of variation attributable to 
individual variance tends to vanish. 

Nonuniformity of individual vari- 
ance. The individual variation, how- 
ever, is most likely not constant over 
the final distribution: (a), an extreme 
score can vary in only one direction, 
towards the mean: (6), the indi- 
viduals receiving extreme scores have 
shown perfectly consistent perform- 
ance throughout, that is, either they 
have always scored a zero or they 
have always scored one. Hence, it 
might be expected that the individual 
Variation, as estimated by a retest, 
should be much smaller at the ex- 
tremes than in the middle of the dis- 
tribution. 

Empirical check. To assess this pos- 
sibility a retest or validation experi- 
ment may be performed. In our il- 
lustration, the Ss receiving extreme 
X, scores of 15 and 14 were combined 
and put through »’=10 additional 
trials; also those receiving middle 
X, scores of 7 and 8 were put through 
a retest of 10 trials. For the extreme 
categories o;,-?=4.00, while for the 
middle categories o,,-?=5.82, the lat- 
ter being significantly larger than the 
predicted variance for the middle 


categories. It is evident that the as- 
sumption of uniformity of individual 
variance over the whole X; scale is 
doubtful. 


LIMITS OF SELECTIVE BREEDING 


How many generations is it neces- 
sary or practical to continue a selec- 
tive breeding program, i.e., what are 
the criteria for stopping? The indi- 
vidual standard deviation, a.= \/ Vo, 
provides an answer to this question: 
it is useless to attempt further selec- 
tion in any line beyond the point 
where its ¢,=0,; at that point the 
method of observation no longer reli- 
ably differentiates individuals, i.e., 
neither selection nor the evaluation 
of the results of selection are any 
longer possible. In our case, no fur- 
ther selective breeding would be at- 
tempted in any line whose o; was 
much below 1.6. 


SUMMARY 


Fast breeding, prolific, small or- 
ganisms are pre-eminently suited for 
studies in the field of behavior genet- 
ics. Their value as experimental Ss 
is further enhanced by the method of 
mass screening that succeeds in com- 
bining the objective of reliable indi- 
vidual measurement with that of 
mass observation. Hence, it is now 
possible to achieve the experimental 
desiderata of efficiency, reliability, 
and brevity in the field of behavior 
genetics. The method is illustrated 
by experiments on the geotropic re- 
sponses of Drosophila. 
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